Thread: BUG #18558: ALTER PUBLICATION fails with unhelpful error on attempt to use system column

The following bug has been logged on the website:

Bug reference:      18558
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 17beta2
Operating system:   Ubuntu 22.04
Description:

The following script:
CREATE TABLE t(a int);
CREATE PUBLICATION p FOR TABLE t(a);

ALTER PUBLICATION p SET TABLE t (a, ctid);
triggers
ERROR:  negative bitmapset member not allowed

Whilst:
CREATE PUBLICATION p FOR TABLE t(a, ctid);
ends up with a more informative
ERROR:  cannot use system column "ctid" in publication column list

Reproduced on REL_15_STABLE .. master.


On Sat, Jul 27, 2024 at 11:15 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18558
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 17beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> The following script:
> CREATE TABLE t(a int);
> CREATE PUBLICATION p FOR TABLE t(a);
>
> ALTER PUBLICATION p SET TABLE t (a, ctid);
> triggers
> ERROR:  negative bitmapset member not allowed
>
> Whilst:
> CREATE PUBLICATION p FOR TABLE t(a, ctid);
> ends up with a more informative
> ERROR:  cannot use system column "ctid" in publication column list
>
> Reproduced on REL_15_STABLE .. master.
>

Thank you for reporting the inconsistent/unfriendly error message.

~~~

The good message::
The good error message comes from publication_translate_columns()
called by publication_add_relation(). This is why the good message
happens for CREATE PUBLICATION. So, you will also find that ALTER
PUBLICATION .. ADD TABLE would give this same good error message:

test_pub=# CREATE TABLE t2(a int);
CREATE TABLE
test_pub=# ALTER PUBLICATION p ADD TABLE t2(a, ctid);
ERROR:  cannot use system column "ctid" in publication column list

~~~

The bad message:
OTOH, "ALTER PUBLICATION ... SET TABLE" uses different logic; it first
builds a BitMapSet (BMS) for the old and new column lists to compare
what has changed. No validation is done here before building the new
BMS so it results in the low-level (not user-friendly) error message
caused by the "ctid" column.

~~~

My fix:
I feel the ALTER ... SET and CREATE PUBLICATION the same column list
validation logic. But instead of cut/pasting that validation checking
from publication_translate_columns(), attached is a diff patch that
exposes the publication_translate_columns() so we can just call that
same function.

Result:
Now all these scenarios will produce the same good error, below.

test_pub=# CREATE TABLE t(a int);
CREATE TABLE

test_pub=# CREATE PUBLICATION p FOR TABLE t(a, cid);
2024-07-29 11:30:16.809 AEST [2281] ERROR:  column "cid" of relation
"t" does not exist
2024-07-29 11:30:16.809 AEST [2281] STATEMENT:  CREATE PUBLICATION p
FOR TABLE t(a, cid);
ERROR:  column "cid" of relation "t" does not exist

test_pub=# CREATE PUBLICATION p FOR TABLE t(a);
CREATE PUBLICATION

test_pub=# ALTER PUBLICATION p SET TABLE t(a, ctid);
2024-07-29 11:30:36.579 AEST [2281] ERROR:  cannot use system column
"ctid" in publication column list
2024-07-29 11:30:36.579 AEST [2281] STATEMENT:  ALTER PUBLICATION p
SET TABLE t(a, ctid);
ERROR:  cannot use system column "ctid" in publication column list

~~~

If this is deemed an acceptable fix, then I will improve on it (e.g.
IMO publication_translate_columns can modified to return the BMS), and
I will also add the necessary test cases.

======
Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachment
Peter Smith <smithpb2250@gmail.com> writes:
> On Sat, Jul 27, 2024 at 11:15 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> The following script:
>> CREATE TABLE t(a int);
>> CREATE PUBLICATION p FOR TABLE t(a);
>> 
>> ALTER PUBLICATION p SET TABLE t (a, ctid);
>> triggers
>> ERROR:  negative bitmapset member not allowed

> My fix:
> I feel the ALTER ... SET and CREATE PUBLICATION the same column list
> validation logic. But instead of cut/pasting that validation checking
> from publication_translate_columns(), attached is a diff patch that
> exposes the publication_translate_columns() so we can just call that
> same function.

Agreed on that, but shouldn't this patch also be removing some code
from the ALTER ... SET path?  Or is that part of the cleanup you
handwaved about?

> If this is deemed an acceptable fix, then I will improve on it (e.g.
> IMO publication_translate_columns can modified to return the BMS), and
> I will also add the necessary test cases.

Have at it ...

            regards, tom lane



On Mon, Jul 29, 2024 at 12:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Smith <smithpb2250@gmail.com> writes:
> > On Sat, Jul 27, 2024 at 11:15 PM PG Bug reporting form
> > <noreply@postgresql.org> wrote:
> >> The following script:
> >> CREATE TABLE t(a int);
> >> CREATE PUBLICATION p FOR TABLE t(a);
> >>
> >> ALTER PUBLICATION p SET TABLE t (a, ctid);
> >> triggers
> >> ERROR:  negative bitmapset member not allowed
>
> > My fix:
> > I feel the ALTER ... SET and CREATE PUBLICATION the same column list
> > validation logic. But instead of cut/pasting that validation checking
> > from publication_translate_columns(), attached is a diff patch that
> > exposes the publication_translate_columns() so we can just call that
> > same function.
>
> Agreed on that, but shouldn't this patch also be removing some code
> from the ALTER ... SET path?  Or is that part of the cleanup you
> handwaved about?

My hand waving was about the following code which is building the
'newcolumns' BMS:
foreach(lc, newpubrel->columns)
{
char    *colname = strVal(lfirst(lc));
AttrNumber attnum = get_attnum(newrelid, colname);
newcolumns = bms_add_member(newcolumns, attnum);
}

I was thinking that if publication_translate_columns() is modified to
return the BMS, which it is building internally anyway, then we avoid
processing the column list 2x. Then the above ALTER SET code can be
removed. Is that the same code ("shouldn't this patch also be removing
some code") you were referring to?

>
> > If this is deemed an acceptable fix, then I will improve on it (e.g.
> > IMO publication_translate_columns can modified to return the BMS), and
> > I will also add the necessary test cases.
>
> Have at it ...

Thanks!

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Peter Smith <smithpb2250@gmail.com> writes:
> On Mon, Jul 29, 2024 at 12:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Agreed on that, but shouldn't this patch also be removing some code
>> from the ALTER ... SET path?  Or is that part of the cleanup you
>> handwaved about?

> I was thinking that if publication_translate_columns() is modified to
> return the BMS, which it is building internally anyway, then we avoid
> processing the column list 2x. Then the above ALTER SET code can be
> removed. Is that the same code ("shouldn't this patch also be removing
> some code") you were referring to?

If publication_translate_columns is already building that exact same
BMS, then yeah we're on the same page.  I hadn't checked the code to
see.

If you'd have to add code to publication_translate_columns, then maybe
it'd be better to make AlterPublicationTables build the BMS from what
publication_translate_columns returns presently.

            regards, tom lane



On Mon, Jul 29, 2024 at 1:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Smith <smithpb2250@gmail.com> writes:
> > On Mon, Jul 29, 2024 at 12:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Agreed on that, but shouldn't this patch also be removing some code
> >> from the ALTER ... SET path?  Or is that part of the cleanup you
> >> handwaved about?
>
> > I was thinking that if publication_translate_columns() is modified to
> > return the BMS, which it is building internally anyway, then we avoid
> > processing the column list 2x. Then the above ALTER SET code can be
> > removed. Is that the same code ("shouldn't this patch also be removing
> > some code") you were referring to?
>
> If publication_translate_columns is already building that exact same
> BMS, then yeah we're on the same page.  I hadn't checked the code to
> see.
>
> If you'd have to add code to publication_translate_columns, then maybe
> it'd be better to make AlterPublicationTables build the BMS from what
> publication_translate_columns returns presently.
>

I chose the 2nd strategy and added a test case. In passing, made
better use of the existing 'newrelid' variable, otherwise that
variable would become unused.

Please see attached fix v2

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment
On Mon, 29 Jul 2024 at 17:19, Peter Smith <smithpb2250@gmail.com> wrote:
> Please see attached fix v2

It's likely also worth fixing the incorrect header comment for
publication_translate_columns() while fixing this.  "and a Bitmapset
with them;" seems to be incorrect and not all that well phrased.

On the whole, I think the API of these functions could be improved as
it's made the fix for this bug more convoluted than it needs to be.
It would be much nicer if publication_translate_columns() returned a
Bitmapset *.  That would reduce the code size by a dozen or so lines
as you could get rid of the qsort() and the compare_int16 function.
Converting a bitmapset to a vector will lead to a naturally sorted
vector. Doing this would mean having to invent a function that takes a
Bitmapset to do the job of buildint2vector, which is effectively, the
inverse of pub_collist_to_bitmapset(). You could probably also strip
about 7 lines out of pub_collist_to_bitmapset() by just doing
Bitmapset  *result = columns; It probably won't change the compiled
code, however.

I'm on the fence if this should be done as part of this bug fix.  The
reason I think it might is that you're changing
publication_translate_columns() to be non-static, and if the above API
change gets done later, that's then changing the API of an existing
external function. The reason against is that it's more risky to do in
the back branches as it's changing more code. What do others think?

Is it worth adding an ALTER PUBLICATION test with a duplicate column too?

David



On Mon, Jul 29, 2024 at 4:14 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Mon, 29 Jul 2024 at 17:19, Peter Smith <smithpb2250@gmail.com> wrote:
> > Please see attached fix v2
>
> It's likely also worth fixing the incorrect header comment for
> publication_translate_columns() while fixing this.  "and a Bitmapset
> with them;" seems to be incorrect and not all that well phrased.
>
Yeah, I noticed that appeared misleading.

> On the whole, I think the API of these functions could be improved as
> it's made the fix for this bug more convoluted than it needs to be.

I agree. It is better than it was, but is still jumping through some
hoops a bit to get the bitmap.

> It would be much nicer if publication_translate_columns() returned a
> Bitmapset *.  That would reduce the code size by a dozen or so lines
> as you could get rid of the qsort() and the compare_int16 function.
> Converting a bitmapset to a vector will lead to a naturally sorted
> vector. Doing this would mean having to invent a function that takes a
> Bitmapset to do the job of buildint2vector, which is effectively, the
> inverse of pub_collist_to_bitmapset(). You could probably also strip
> about 7 lines out of pub_collist_to_bitmapset() by just doing
> Bitmapset  *result = columns; It probably won't change the compiled
> code, however.
>
> I'm on the fence if this should be done as part of this bug fix.  The
> reason I think it might is that you're changing
> publication_translate_columns() to be non-static, and if the above API
> change gets done later, that's then changing the API of an existing
> external function. The reason against is that it's more risky to do in
> the back branches as it's changing more code. What do others think?
>

I'll wait until tomorrow in case there are other opinions as to
whether that belongs in this patch or elsewhere..

TBH, I was unsure if this error bugfix patch qualified to have
backpatches or not. Although it is a "bug" it was not impacting
anybody because it is only substituting one error msg for another
nicer error msg.

> Is it worth adding an ALTER PUBLICATION test with a duplicate column too?

+1 to do that.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



David Rowley <dgrowleyml@gmail.com> writes:
> On the whole, I think the API of these functions could be improved as
> it's made the fix for this bug more convoluted than it needs to be.

+1

> I'm on the fence if this should be done as part of this bug fix.  The
> reason I think it might is that you're changing
> publication_translate_columns() to be non-static, and if the above API
> change gets done later, that's then changing the API of an existing
> external function. The reason against is that it's more risky to do in
> the back branches as it's changing more code. What do others think?

If we are going to export a formerly-static function, we should make
its API as clean as we can.  I doubt that that adds meaningful risk,
and it avoids people possibly getting bit by cross-branch differences.

> Is it worth adding an ALTER PUBLICATION test with a duplicate column too?

Probably, just to prove that the non-duplicative representation does
what we want.

            regards, tom lane



Peter Smith <smithpb2250@gmail.com> writes:
> TBH, I was unsure if this error bugfix patch qualified to have
> backpatches or not. Although it is a "bug" it was not impacting
> anybody because it is only substituting one error msg for another
> nicer error msg.

True, we don't really have to back-patch.  I'm inclined to do so,
but might think differently once we see how big the final patch is.

            regards, tom lane



Sorry for my delay getting back to this thread.

Here is an updated patch v3 which has the following changes.

1. API.
The publication_translate_columns function now optionally returns the
Bitmapset* (that it was building anyway). The function comment was
also changed.

The patch is not quite the radical change suggested above [1]. I found
the currently returned 'attarray' is used in multiple places (in
publication_add_relation) so it seemed less invasive to leave those
other publication_translate_columns parameters as-is.

2. Tests.
Patch is now also testing for duplicates in a publication column list

======
[1] https://www.postgresql.org/message-id/CAApHDvo2i1j_iCFcURx5q7jYe70qk4Ca7J%2B8Dt9_jSMOdooAOA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment
On Fri, 2 Aug 2024 at 14:52, Peter Smith <smithpb2250@gmail.com> wrote:
> 1. API.
> The publication_translate_columns function now optionally returns the
> Bitmapset* (that it was building anyway). The function comment was
> also changed.
>
> The patch is not quite the radical change suggested above [1]. I found
> the currently returned 'attarray' is used in multiple places (in
> publication_add_relation) so it seemed less invasive to leave those
> other publication_translate_columns parameters as-is.

Looking at this, I only just noticed that fixing the bug is quite
simple. We could just do something like:

-                                               newcolumns =
bms_add_member(newcolumns, attnum);
+                                               if (attnum >= 0)
+                                                       newcolumns =
bms_add_member(newcolumns, attnum);

as later in AlterPublicationTables() the PublicationAddTables()
function is called and the column list is validated anyway.  So if
there are system attributes in the columns list, they'll still be
found and rejected.

For the other refactoring stuff. I know you said you didn't do it
because of the other usages of that array, but those could be fixed
with a bms_next_member() loop. I did that and ended up with:

$ git diff --stat -- src/backend/catalog/pg_publication.c
 src/backend/catalog/pg_publication.c | 105 ++++++++++++++---------------------
 1 file changed, 41 insertions(+), 64 deletions(-)

It saves about 2 dozen lines. However, I'm not as happy with the idea
as I thought I'd be.  I think the refactor feels a bit more pointless
if the bug was fixed with the attnum >= 0 fix.

I've attached patches for the record.

David

Attachment
On Sat, Aug 3, 2024 at 12:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 2 Aug 2024 at 14:52, Peter Smith <smithpb2250@gmail.com> wrote:
> > 1. API.
> > The publication_translate_columns function now optionally returns the
> > Bitmapset* (that it was building anyway). The function comment was
> > also changed.
> >
> > The patch is not quite the radical change suggested above [1]. I found
> > the currently returned 'attarray' is used in multiple places (in
> > publication_add_relation) so it seemed less invasive to leave those
> > other publication_translate_columns parameters as-is.
>
> Looking at this, I only just noticed that fixing the bug is quite
> simple. We could just do something like:
>
> -                                               newcolumns =
> bms_add_member(newcolumns, attnum);
> +                                               if (attnum >= 0)
> +                                                       newcolumns =
> bms_add_member(newcolumns, attnum);
>
> as later in AlterPublicationTables() the PublicationAddTables()
> function is called and the column list is validated anyway.  So if
> there are system attributes in the columns list, they'll still be
> found and rejected.
>

... which sounds good. But did you try it?

That AlterPublicationTables_fix.patch.txt patch only prevents the
"negative bitmapset member not allowed" error, but the validation for
system- or generated- columns (which is currently done only by
publication_translate_columns function) is not getting executed.
AFAICT since AlterPublicationTables does not now detect there is any
difference between 'oldcolumns' and 'newcolumns' the
PublicationDropTables decides not to drop the table, then
PublicationAddTables also returns early because the table already
exists.

The result is the bad columns that we specified using ALTER
PUBLICATION ... SET TABLE get silently overlooked.

For example,

test_pub=#
test_pub=# CREATE TABLE t1(a int, b int);
CREATE TABLE
test_pub=# CREATE TABLE t2(a int, b int);
CREATE TABLE
test_pub=# CREATE PUBLICATION pub1 FOR TABLE t1(a);
CREATE PUBLICATION
test_pub=# ALTER PUBLICATION pub1 ADD TABLE t2(a, ctid);
2024-08-05 16:12:34.002 AEST [6048] ERROR:  cannot use system column
"ctid" in publication column list
2024-08-05 16:12:34.002 AEST [6048] STATEMENT:  ALTER PUBLICATION pub1
ADD TABLE t2(a, ctid);
ERROR:  cannot use system column "ctid" in publication column list
test_pub=# ALTER PUBLICATION pub1 ADD TABLE t2(a, a);
2024-08-05 16:13:06.834 AEST [6048] ERROR:  duplicate column "a" in
publication column list
2024-08-05 16:13:06.834 AEST [6048] STATEMENT:  ALTER PUBLICATION pub1
ADD TABLE t2(a, a);
ERROR:  duplicate column "a" in publication column list

The following SET TABLE commands should fail with those same messages,
but your AlterPublicationTables_fix.patch.txt doesn't yield any errors
at all.

test_pub=# ALTER PUBLICATION pub1 SET TABLE t1(a, ctid);
ALTER PUBLICATION
test_pub=# ALTER PUBLICATION pub1 SET TABLE t1(a, a);
ALTER PUBLICATION
test_pub=#

======
Kind Regards,
Peter Smith.
Fujitsu Australia



On Mon, 5 Aug 2024 at 19:43, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Aug 3, 2024 at 12:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > Looking at this, I only just noticed that fixing the bug is quite
> > simple. We could just do something like:
> >
> > -                                               newcolumns =
> > bms_add_member(newcolumns, attnum);
> > +                                               if (attnum >= 0)
> > +                                                       newcolumns =
> > bms_add_member(newcolumns, attnum);
> >
> > as later in AlterPublicationTables() the PublicationAddTables()
> > function is called and the column list is validated anyway.  So if
> > there are system attributes in the columns list, they'll still be
> > found and rejected.
> >
>
> ... which sounds good. But did you try it?

Seemingly not well enough. What I was trying to get around was the
double validation when modifying the column list or WHERE clause of an
existing table within a publication. I had assumed that because
PublicationAddTables() is still called that the validation is done
there.  What actually happens is PublicationAddTables() is called with
"if_not_exists" == true and we short-circuit out (without validating
the columns) when we see that the table already exists in the
publication.

Maybe it's not worth going to the trouble of rearranging the code so
the validation is only done once.  It is fairly cheap. get_attnum() is
a hash table lookup and the algorithm is just O(n) for the number of
attributes. It's probably not going to be a big deal on the
performance front.

Here's the patch updated to do the validation inside AlterPublicationTables().

David

Attachment
On Mon, Aug 12, 2024 at 5:45 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
...
>
> Here's the patch updated to do the validation inside AlterPublicationTables().
>

Hi David,

I think we've come full circle -- your fix is now pretty much the same
as my v3 patch [1].

The main difference is you have taken the opportunity to tidy some of
the surrounding functions at the same time as making the fix.

Your patch looks good to me.

~~~

publication_add_relation:
nit - variable /Bitmapset  *columns;/Bitmapset  *attnums;/
nit - comment /validate and translate.../Validate and translate.../

publication_validate_column_list:
nit - consider renaming this function to 'pub_collist_validate' (just
to be more like the existing 'pub_collist_to_bitmapset')

======
[1] https://www.postgresql.org/message-id/CAHut%2BPtp6Ztk-Dst_WDmQpAw%2BZy9EbP_8QSNk5NkDEjP%3D8xK-A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia