Thread: BUG #18203: Logical Replication initial sync failure

BUG #18203: Logical Replication initial sync failure

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18203
Logged by:          Justin G
Email address:      zzzzz.graf@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 20
Description:

We’ve found an edge case that breaks logical replication.

Procedure:

alter system set wal_level to 'logical' ;

create database source;  --pg 11
create database destin; --pg 15

create table no_col(); -- pg 11

insert into no_col default values;
insert into no_col default values;
insert into no_col default values;
insert into no_col default values;
insert into no_col default values;

CREATE PUBLICATION no_col_pub for table no_col;

create table no_col(); --pg 15

CREATE SUBSCRIPTION no_col_sub --pg15
    CONNECTION
        'host=<hostname/ip>
        port=5432
        user=<set username>
        dbname=source
        password=<set the password>'                
    PUBLICATION no_col_pub
    WITH (
        CREATE_SLOT    = TRUE,
        ENABLED        = TRUE,
        COPY_DATA    = TRUE
        );


We now have a table with 5 null rows. This is a valid table, pg_dump will
dump it and restore it. count(*) will return 5.

The dump file produced by pg_dump version 15 generates this SQL
COPY public.no_col  FROM stdin;

However, the copy command from PG 15 using logical replication includes (),
which normally has a list of columns between the parentheses. 

COPY public.no_col () TO STDOUT

Logical replication worker returns with the following error: 
2023-11-17 20:40:16.141 UTC [151084] ERROR:  could not start initial
contents copy for table "public.no_col": ERROR:  syntax error at or near
")"
    LINE 1: COPY public.no_col () TO STDOUT

If the direction of replication is reversed from PG15 to PG14, the above
error is not thrown, and the 5 rows are copied over. 

Version tested
Publishers  PG 11.14, 14.2, 15.2
Subscriber PG 14.2, 15.2, , RDS 14.7  
The error has been observed in 15.2 and in RDS PostgreSQL 14.7.


Re: BUG #18203: Logical Replication initial sync failure

From
vignesh C
Date:
On Sat, 18 Nov 2023 at 18:10, PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18203
> Logged by:          Justin G
> Email address:      zzzzz.graf@gmail.com
> PostgreSQL version: 15.2
> Operating system:   Ubuntu 20
> Description:
>
> We’ve found an edge case that breaks logical replication.
>
> Procedure:
>
> alter system set wal_level to 'logical' ;
>
> create database source;  --pg 11
> create database destin; --pg 15
>
> create table no_col(); -- pg 11
>
> insert into no_col default values;
> insert into no_col default values;
> insert into no_col default values;
> insert into no_col default values;
> insert into no_col default values;
>
> CREATE PUBLICATION no_col_pub for table no_col;
>
> create table no_col(); --pg 15
>
> CREATE SUBSCRIPTION no_col_sub --pg15
>     CONNECTION
>         'host=<hostname/ip>
>         port=5432
>         user=<set username>
>         dbname=source
>         password=<set the password>'
>     PUBLICATION no_col_pub
>     WITH (
>         CREATE_SLOT    = TRUE,
>         ENABLED        = TRUE,
>         COPY_DATA    = TRUE
>         );
>
>
> We now have a table with 5 null rows. This is a valid table, pg_dump will
> dump it and restore it. count(*) will return 5.
>
> The dump file produced by pg_dump version 15 generates this SQL
> COPY public.no_col  FROM stdin;
>
> However, the copy command from PG 15 using logical replication includes (),
> which normally has a list of columns between the parentheses.
>
> COPY public.no_col () TO STDOUT
>
> Logical replication worker returns with the following error:
> 2023-11-17 20:40:16.141 UTC [151084] ERROR:  could not start initial
> contents copy for table "public.no_col": ERROR:  syntax error at or near
> ")"
>     LINE 1: COPY public.no_col () TO STDOUT

Thanks for reporting this issue, I was able to reproduce the issue.
This issue is happening because we are trying to specify the column
list while the table sync worker is copying data for the table.
I felt we should not specify the column list when the table has no columns.
Attached patch has the changes to handle the same.

Regards,
Vignesh

Attachment

Re: BUG #18203: Logical Replication initial sync failure

From
Amit Kapila
Date:
On Mon, Nov 20, 2023 at 2:53 PM vignesh C <vignesh21@gmail.com> wrote:
>
> >
> > Logical replication worker returns with the following error:
> > 2023-11-17 20:40:16.141 UTC [151084] ERROR:  could not start initial
> > contents copy for table "public.no_col": ERROR:  syntax error at or near
> > ")"
> >     LINE 1: COPY public.no_col () TO STDOUT
>
> Thanks for reporting this issue, I was able to reproduce the issue.
> This issue is happening because we are trying to specify the column
> list while the table sync worker is copying data for the table.
> I felt we should not specify the column list when the table has no columns.
> Attached patch has the changes to handle the same.
>

I think we can add a test for such cases which tests both initial sync
and regular replication on a table with no column.

--
With Regards,
Amit Kapila.



Re: BUG #18203: Logical Replication initial sync failure

From
Peter Smith
Date:
On Mon, Nov 20, 2023 at 8:23 PM vignesh C <vignesh21@gmail.com> wrote:
>
...
> Thanks for reporting this issue, I was able to reproduce the issue.
> This issue is happening because we are trying to specify the column
> list while the table sync worker is copying data for the table.
> I felt we should not specify the column list when the table has no columns.
> Attached patch has the changes to handle the same.
>

Hi Vignesh,

One small comment about the patch:

- appendStringInfo(&cmd, "COPY %s (",
+ appendStringInfo(&cmd, "COPY %s ",
  quote_qualified_identifier(lrel.nspname, lrel.relname));

- /*
- * XXX Do we need to list the columns in all cases? Maybe we're
- * replicating all columns?
- */
- for (int i = 0; i < lrel.natts; i++)
+ /* If the table has columns, then specify the columns */
+ if (lrel.natts)
  {
- if (i > 0)
- appendStringInfoString(&cmd, ", ");
+ appendStringInfoString(&cmd, "(");

For consistent whitespace handling, and to prevent double-spacing when
there are no columns, I think that the trailing space of "COPY %s "
should be removed/replaced by a leading space for the "(".

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: BUG #18203: Logical Replication initial sync failure

From
vignesh C
Date:
On Mon, 20 Nov 2023 at 18:38, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 20, 2023 at 2:53 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > >
> > > Logical replication worker returns with the following error:
> > > 2023-11-17 20:40:16.141 UTC [151084] ERROR:  could not start initial
> > > contents copy for table "public.no_col": ERROR:  syntax error at or near
> > > ")"
> > >     LINE 1: COPY public.no_col () TO STDOUT
> >
> > Thanks for reporting this issue, I was able to reproduce the issue.
> > This issue is happening because we are trying to specify the column
> > list while the table sync worker is copying data for the table.
> > I felt we should not specify the column list when the table has no columns.
> > Attached patch has the changes to handle the same.
> >
>
> I think we can add a test for such cases which tests both initial sync
> and regular replication on a table with no column.

I have added a test for the same and tested upto PG15 where the
changes were added.
The patches for the same are attached.

Regards,
Vignesh

Attachment

Re: BUG #18203: Logical Replication initial sync failure

From
vignesh C
Date:
On Tue, 21 Nov 2023 at 03:38, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Nov 20, 2023 at 8:23 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> ...
> > Thanks for reporting this issue, I was able to reproduce the issue.
> > This issue is happening because we are trying to specify the column
> > list while the table sync worker is copying data for the table.
> > I felt we should not specify the column list when the table has no columns.
> > Attached patch has the changes to handle the same.
> >
>
> Hi Vignesh,
>
> One small comment about the patch:
>
> - appendStringInfo(&cmd, "COPY %s (",
> + appendStringInfo(&cmd, "COPY %s ",
>   quote_qualified_identifier(lrel.nspname, lrel.relname));
>
> - /*
> - * XXX Do we need to list the columns in all cases? Maybe we're
> - * replicating all columns?
> - */
> - for (int i = 0; i < lrel.natts; i++)
> + /* If the table has columns, then specify the columns */
> + if (lrel.natts)
>   {
> - if (i > 0)
> - appendStringInfoString(&cmd, ", ");
> + appendStringInfoString(&cmd, "(");
>
> For consistent whitespace handling, and to prevent double-spacing when
> there are no columns, I think that the trailing space of "COPY %s "
> should be removed/replaced by a leading space for the "(".

This is fixed in the v2 version patch attached at:
https://www.postgresql.org/message-id/CALDaNm2kSJCVree6R1iKxjbA%3DSg4raFi3Fi_%2BwfF2xn2roxqCg%40mail.gmail.com

Regards,
Vignesh



Re: BUG #18203: Logical Replication initial sync failure

From
Peter Smith
Date:
The v2 patch LGTM except for the commit message.

a) wording - "for tables have no columns too"
b) the plural of "parenthesis" is "parentheses"

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: BUG #18203: Logical Replication initial sync failure

From
vignesh C
Date:
On Tue, 21 Nov 2023 at 13:31, Peter Smith <smithpb2250@gmail.com> wrote:
>
> The v2 patch LGTM except for the commit message.
>
> a) wording - "for tables have no columns too"
> b) the plural of "parenthesis" is "parentheses"

Thanks, these are fixed in the attached v3 version patch.

Regards,
Vignesh

Attachment