Thread: [BUGS] BUG #14785: Logical replication does not work after adding a column.Bug?

The following bug has been logged on the website:

Bug reference:      14785
Logged by:          yxq
Email address:      yxq@o2.pl
PostgreSQL version: 10beta3
Operating system:   Linux Debian Stretch 64-bit
Description:

Server Master:port = 5432wal_level = logical

Server Slave:port = 5433wal_level = logical



***** Steps to reproduce:


* Master:
CREATE TABLE users(    id SERIAL PRIMARY KEY,    firstname VARCHAR(100) NOT NULL,    lastname VARCHAR(100) NOT
NULL);INSERTINTO users(firstname,lastname) VALUES('fu1','lu1');INSERT INTO users(firstname,lastname)
VALUES('fu2','lu2');SELECT* FROM users;
 

result:
id | firstname | lastname 
----+-----------+---------- 1 | fu1       | lu1 2 | fu2       | lu2
(2 rows)   

   CREATE PUBLICATION mypub FOR ALL TABLES;




*** Slave:
CREATE TABLE users(    id SERIAL PRIMARY KEY,    firstname VARCHAR(100) NOT NULL,    lastname VARCHAR(100) NOT NULL);
CREATE SUBSCRIPTION mysub         CONNECTION 'port=5432 dbname=postgres'        PUBLICATION mypub;SELECT * FROM users;

result:
id | firstname | lastname 
----+-----------+---------- 1 | fu1       | lu1 2 | fu2       | lu2
(2 rows)   




*** Master:
INSERT INTO users(firstname,lastname) VALUES('fu3','lu3');
SELECT * FROM users;

result:
id | firstname | lastname 
----+-----------+---------- 1 | fu1       | lu1 2 | fu2       | lu2 3 | fu3       | lu3
(3 rows)   




*** Slave:
SELECT * FROM users;

result:
id | firstname | lastname 
----+-----------+---------- 1 | fu1       | lu1 2 | fu2       | lu2 3 | fu3       | lu3
(3 rows)   

ALTER TABLE users ADD enabled BOOLEAN NOT NULL DEFAULT true;



*** Master:
ALTER TABLE users ADD enabled BOOLEAN NOT NULL DEFAULT true;
INSERT INTO users(firstname,lastname) VALUES('fu4','lu4');
SELECT * FROM users;

result:
id | firstname | lastname | enabled 
----+-----------+----------+--------- 1 | fu1       | lu1      | t 2 | fu2       | lu2      | t 3 | fu3       | lu3
| t 4 | fu4       | lu4      | t
 
(4 rows)




*** Slave:
SELECT * FROM users;

result:
id | firstname | lastname | enabled 
----+-----------+----------+--------- 1 | fu1       | lu1      | t 2 | fu2       | lu2      | t 3 | fu3       | lu3
| t
 
(3 rows)




*** Slave Log:

2017-08-20 18:25:59.430 UTC [4764] LOG:  database system is ready to accept
connections
2017-08-20 18:27:09.938 UTC [4781] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:27:09.947 UTC [4783] LOG:  logical replication table
synchronization worker for subscription "mysub", table "users" has started
2017-08-20 18:27:10.354 UTC [4783] LOG:  logical replication table
synchronization worker for subscription "mysub", table "users" has
finished
2017-08-20 18:42:46.121 UTC [4781] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:42:46.122 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4781) exited with exit code 1
2017-08-20 18:42:46.124 UTC [4837] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:42:46.154 UTC [4837] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:42:46.155 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4837) exited with exit code 1
2017-08-20 18:42:51.163 UTC [4839] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:42:51.221 UTC [4839] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:42:51.223 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4839) exited with exit code 1
2017-08-20 18:42:56.234 UTC [4841] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:42:56.378 UTC [4841] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:42:56.380 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4841) exited with exit code 1
2017-08-20 18:43:01.390 UTC [4845] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:43:01.568 UTC [4845] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:43:01.570 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4845) exited with exit code 1
2017-08-20 18:43:06.580 UTC [4847] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:43:06.726 UTC [4847] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:43:06.729 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4847) exited with exit code 1
2017-08-20 18:43:11.739 UTC [4849] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:43:11.906 UTC [4849] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:43:11.908 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4849) exited with exit code 1
2017-08-20 18:43:16.918 UTC [4851] LOG:  logical replication apply worker
for subscription "mysub" has started
2017-08-20 18:43:17.077 UTC [4851] ERROR:  logical replication target
relation "public.pg_temp_16386" does not exist
2017-08-20 18:43:17.079 UTC [4764] LOG:  worker process: logical replication
worker for subscription 16392 (PID 4851) exited with exit code 1




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work after addinga column. Bug?

From
Peter Eisentraut
Date:
When you add a column on the publication side, you also need to add it
on the subscription side, otherwise there is nowhere to put the data.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Hi,

On 2017-08-22 15:22:41 -0400, Peter Eisentraut wrote:
> When you add a column on the publication side, you also need to add it
> on the subscription side, otherwise there is nowhere to put the data.

Op's sql shows that that's done. The problem is the table rewrite not
being handled nicely by logical decoding / replication.

- Andres


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work after addinga column. Bug?

From
Peter Eisentraut
Date:
On 8/22/17 15:34, Andres Freund wrote:
> On 2017-08-22 15:22:41 -0400, Peter Eisentraut wrote:
>> When you add a column on the publication side, you also need to add it
>> on the subscription side, otherwise there is nowhere to put the data.
> 
> Op's sql shows that that's done. The problem is the table rewrite not
> being handled nicely by logical decoding / replication.

OK, I see it now.

The problem happens on the publisher side.  After the table rewrite on
the publisher side, the output plugin starts sending the wrong relname.

The name comes straight from RelationGetRelationName() (in
logicalrep_write_rel()) with the Relation that gets passed into the
pgoutput_change() callback, so there doesn't appear to be an obvious
logic error in the output plugin.

Any ideas?  Do we need to renew a snapshot somehow, perhaps?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On 2017-08-23 09:25:13 -0400, Peter Eisentraut wrote:
> On 8/22/17 15:34, Andres Freund wrote:
> > On 2017-08-22 15:22:41 -0400, Peter Eisentraut wrote:
> >> When you add a column on the publication side, you also need to add it
> >> on the subscription side, otherwise there is nowhere to put the data.
> > 
> > Op's sql shows that that's done. The problem is the table rewrite not
> > being handled nicely by logical decoding / replication.
> 
> OK, I see it now.
> 
> The problem happens on the publisher side.  After the table rewrite on
> the publisher side, the output plugin starts sending the wrong relname.
> 
> The name comes straight from RelationGetRelationName() (in
> logicalrep_write_rel()) with the Relation that gets passed into the
> pgoutput_change() callback, so there doesn't appear to be an obvious
> logic error in the output plugin.
> 
> Any ideas?  Do we need to renew a snapshot somehow, perhaps?

This is "known" behaviour - this is the actual data WAL logged :(. Table
rewrites generate these pg_temp* tables and log the data there...

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work after addinga column. Bug?

From
Peter Eisentraut
Date:
On 8/23/17 12:31, Andres Freund wrote:
> This is "known" behaviour - this is the actual data WAL logged :(. Table
> rewrites generate these pg_temp* tables and log the data there...

Hmm, I see.

Possibly, one way a user could recover from this is to add the column on
the subscriber, rename to table on the subscriber to the temp name, then
wait until all the changes from the rewrite are applied, at which point
it should start complaining in the logs that the original table name
does not exist, then rename the table back.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On 2017-08-23 17:35:11 -0400, Peter Eisentraut wrote:
> On 8/23/17 12:31, Andres Freund wrote:
> > This is "known" behaviour - this is the actual data WAL logged :(. Table
> > rewrites generate these pg_temp* tables and log the data there...
> 
> Hmm, I see.
> 
> Possibly, one way a user could recover from this is to add the column on
> the subscriber, rename to table on the subscriber to the temp name, then
> wait until all the changes from the rewrite are applied, at which point
> it should start complaining in the logs that the original table name
> does not exist, then rename the table back.

I think we could actually kind of solve this by just ignoring pg_temp*
tables in the output plugin. Given we don't do DDL replication at this
point, that seems good enough. "all" we need is a way to make sure we're
not confusing the pg_temp* tables with a table the user has declared as
pg_temp - but we could check subscription state for that?

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work after addinga column. Bug?

From
Peter Eisentraut
Date:
On 8/23/17 17:39, Andres Freund wrote:
> I think we could actually kind of solve this by just ignoring pg_temp*
> tables in the output plugin. Given we don't do DDL replication at this
> point, that seems good enough. "all" we need is a way to make sure we're
> not confusing the pg_temp* tables with a table the user has declared as
> pg_temp - but we could check subscription state for that?

How about this patch?  There is some heuristic element in there, but it
seems better than failing as we currently do.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] BUG #14785: Logical replication does not work after addinga column. Bug?

From
Peter Eisentraut
Date:
On 9/13/17 09:46, Peter Eisentraut wrote:
> On 8/23/17 17:39, Andres Freund wrote:
>> I think we could actually kind of solve this by just ignoring pg_temp*
>> tables in the output plugin. Given we don't do DDL replication at this
>> point, that seems good enough. "all" we need is a way to make sure we're
>> not confusing the pg_temp* tables with a table the user has declared as
>> pg_temp - but we could check subscription state for that?
> 
> How about this patch?  There is some heuristic element in there, but it
> seems better than failing as we currently do.

Thoughts on this patch?  Might be good to include in PG10.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work afteradding a column. Bug?

From
Andres Freund
Date:
Hi,

On 2017-09-13 09:46:43 -0400, Peter Eisentraut wrote:
> The prevent this problem, we filter out any tables that match this
> naming pattern and match an actual table from FOR ALL TABLES
> publications.  This is only a heuristic, meaning that user tables that
> match that naming could accidentally be omitted.  A more robust solution
> might require an explicit marking of such tables in pg_class somehow.

Yes, I think that makes sense.

> +            /*
> +             * Skip tables that look like they are from a heap rewrite (see
> +             * make_new_heap()).  We need to skip them because the subscriber
> +             * won't have a table by that name to receive the data.  That
> +             * means we won't ship the new data in, say, an added column with
> +             * a DEFAULT, but if the user applies the same DDL manually on the
> +             * subscriber, then this will work out for them.
> +             *
> +             * We only need to consider the alltables case, because such a
> +             * transient heap won't be an explicit member of a publication.
> +             */
> +            if (pub->alltables)
> +            {
> +                char       *relname = get_rel_name(relid);
> +                unsigned int u;
> +
> +                if (sscanf(relname, "pg_temp_%u", &u) == 1)
> +                {
> +                    if (get_rel_relkind(u) == RELKIND_RELATION)
> +                        break;
> +                }
> +            }
> +

This'll accept tablenames like pg_temp_1foo, right? Might be worth
being a bit narrower in the test.

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work after addinga column. Bug?

From
Peter Eisentraut
Date:
On 9/25/17 15:16, Andres Freund wrote:
> This'll accept tablenames like pg_temp_1foo, right? Might be worth
> being a bit narrower in the test.

Committed with that change.  Thanks.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 9/25/17 15:16, Andres Freund wrote:
>> This'll accept tablenames like pg_temp_1foo, right? Might be worth
>> being a bit narrower in the test.

> Committed with that change.  Thanks.

This patch is using the wrong approach entirely.  Every other place in
the backend that is trying to exclude temp relations uses a test on the
containing namespace, not the relname.  You should be using
RELATION_IS_OTHER_TEMP(), or if there's good reason to decide this
without opening the relation, maybe you could get away with
isOtherTempNamespace(get_rel_namespace(relid)).

I also note that as committed, the patch will dump core on a
concurrently-dropped relation, because get_rel_name returns NULL
under such circumstances.

BTW, get_rel_sync_entry has some other serious problems: it is not being
at all careful about whether persistent data structures are left in sane
states if it gets an error partway through.  In particular it'll leave
behind a new hash entry in entirely-unknown state, and if LoadPublications
gets an error, it will also leave a time bomb behind in the form of
not nil, but already-list-freed, data->publications.  And I sure do not
understand why a single static variable publications_valid is being used
to remember validity of data->publications ... couldn't there be more
than one of those?
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work afteradding a column. Bug?

From
Stephen Frost
Date:
Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > On 9/25/17 15:16, Andres Freund wrote:
> >> This'll accept tablenames like pg_temp_1foo, right? Might be worth
> >> being a bit narrower in the test.
>
> > Committed with that change.  Thanks.
>
> This patch is using the wrong approach entirely.  Every other place in
> the backend that is trying to exclude temp relations uses a test on the
> containing namespace, not the relname.

The specific issue here is that the new pg_class entry is created in the
same namespace, not in the temp one.  The commit mentions make_new_heap()
specifically because that's where the issue is coming from because
that's creating this new pg_temp_XXX table in the regular user
namespace.

I'm not a huge fan of this approach either, really, but I'm not sure
that there's a better answer either.

> I also note that as committed, the patch will dump core on a
> concurrently-dropped relation, because get_rel_name returns NULL
> under such circumstances.

That's certainly no good and should be checked for.

> BTW, get_rel_sync_entry has some other serious problems: it is not being
> at all careful about whether persistent data structures are left in sane
> states if it gets an error partway through.  In particular it'll leave
> behind a new hash entry in entirely-unknown state, and if LoadPublications
> gets an error, it will also leave a time bomb behind in the form of
> not nil, but already-list-freed, data->publications.  And I sure do not
> understand why a single static variable publications_valid is being used
> to remember validity of data->publications ... couldn't there be more
> than one of those?

This all certainly doesn't sound good, but I'm not as familiar with
these bits.

Thanks!

Stephen

Re: [BUGS] BUG #14785: Logical replication does not work afteradding a column. Bug?

From
Stephen Frost
Date:
Tom,

* Stephen Frost (sfrost@snowman.net) wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > > On 9/25/17 15:16, Andres Freund wrote:
> > >> This'll accept tablenames like pg_temp_1foo, right? Might be worth
> > >> being a bit narrower in the test.
> >
> > > Committed with that change.  Thanks.
> >
> > This patch is using the wrong approach entirely.  Every other place in
> > the backend that is trying to exclude temp relations uses a test on the
> > containing namespace, not the relname.
>
> The specific issue here is that the new pg_class entry is created in the
> same namespace, not in the temp one.  The commit mentions make_new_heap()
> specifically because that's where the issue is coming from because
> that's creating this new pg_temp_XXX table in the regular user
> namespace.

And, to be clear, the new table also doesn't necessairly have
RELPERSISTENCE_TEMP either.

Really is pretty ugly how make_new_heap() operates, it seems...

Thanks!

Stephen

Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> This patch is using the wrong approach entirely.  Every other place in
>> the backend that is trying to exclude temp relations uses a test on the
>> containing namespace, not the relname.

> The specific issue here is that the new pg_class entry is created in the
> same namespace, not in the temp one.  The commit mentions make_new_heap()
> specifically because that's where the issue is coming from because
> that's creating this new pg_temp_XXX table in the regular user
> namespace.

Um ... but that transient relation should never be visible to any other
backend, because we delete it before committing.  If there is someplace
where this is not the case, then we need to change that, because it'll
break everything else that scans pg_class, such as VACUUM.

In short: this code is unlike every other similar check in the backend.
I find it unlikely that this is right and all the others are wrong.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14785: Logical replication does not work afteradding a column. Bug?

From
Andres Freund
Date:
On 2017-09-26 14:17:21 -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> This patch is using the wrong approach entirely.  Every other place in
> >> the backend that is trying to exclude temp relations uses a test on the
> >> containing namespace, not the relname.
> 
> > The specific issue here is that the new pg_class entry is created in the
> > same namespace, not in the temp one.  The commit mentions make_new_heap()
> > specifically because that's where the issue is coming from because
> > that's creating this new pg_temp_XXX table in the regular user
> > namespace.
> 
> Um ... but that transient relation should never be visible to any other
> backend, because we delete it before committing.  If there is someplace
> where this is not the case, then we need to change that, because it'll
> break everything else that scans pg_class, such as VACUUM.

This is accessed during logical decoding, for filtering purposes.
Because outputting all those rewritten rows for a temp relation isn't
helpful, rather to to the contrary.

Obviously it'd be better if we'd some nicer infrastructure, but that
seems like something for v11+.

> In short: this code is unlike every other similar check in the backend.
> I find it unlikely that this is right and all the others are wrong.

I don't think those checks are comparable.

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs