Re: row filtering for logical replication - Mailing list pgsql-hackers

From Greg Nancarrow
Subject Re: row filtering for logical replication
Date
Msg-id CAJcOf-dApxx3mqeHZ_20O-6hGOXmpMuqPpNb+3ZF3ENfb0oHoA@mail.gmail.com
Whole thread Raw
In response to Re: row filtering for logical replication  ("Euler Taveira" <euler@eulerto.com>)
Responses Re: row filtering for logical replication  ("Euler Taveira" <euler@eulerto.com>)
List pgsql-hackers
On Thu, Jul 1, 2021 at 10:43 AM Euler Taveira <euler@eulerto.com> wrote:
>
> Amit, thanks for rebasing this patch. I already had a similar rebased patch in
> my local tree. A recent patch broke your version v15 so I rebased it.
>
> I like the idea of a simple create_estate_for_relation() function (I fixed an
> oversight regarding GetCurrentCommandId(false) because it is used only for
> read-only purposes). This patch also replaces all references to version 14.
>
> Commit ef948050 made some changes in the snapshot handling. Set the current
> active snapshot might not be required but future changes to allow functions
> will need it.
>
> As the previous patches, it includes commits (0002 and 0003) that are not
> intended to be committed. They are available for test-only purposes.
>

I have some review comments on the "Row filter for logical replication" patch:

(1) Suggested update to patch comment:
(There are some missing words and things which could be better expressed)


This feature adds row filtering for publication tables.
When a publication is defined or modified, rows that don't satisfy a WHERE
clause may be optionally filtered out. This allows a database or set of
tables to be partially replicated. The row filter is per table, which allows
different row filters to be defined for different tables. A new row filter
can be added simply by specifying a WHERE clause after the table name.
The WHERE clause must be enclosed by parentheses.

The WHERE clause should probably contain only columns that are part of the
primary key or that are covered by REPLICA IDENTITY. Otherwise, any DELETEs
won't be replicated. DELETE uses the old row version (that is limited to
primary key or REPLICA IDENTITY) to evaluate the row filter. INSERT and UPDATE
use the new row version to evaluate the row filter, hence, you can use any
column. If the row filter evaluates to NULL, it returns false. For simplicity,
functions are not allowed; that could possibly be addressed in a future patch.

If you choose to do the initial table synchronization, only data that satisfies
the row filters is sent. If the subscription has several publications in which
a table has been published with different WHERE clauses, rows must satisfy all
expressions to be copied. If subscriber is a pre-15 version, data
synchronization won't use row filters if they are defined in the publisher.
Previous versions cannot handle row filters.

If your publication contains a partitioned table, the publication parameter
publish_via_partition_root determines if it uses the partition row filter (if
the parameter is false, the default) or the root partitioned table row filter.


(2) Some inconsistent error message wording:

Currently:
err = _("cannot use subquery in publication WHERE expression");

Suggest changing it to:
err = _("subqueries are not allowed in publication WHERE expressions");


Other examples from the patch:
err = _("aggregate functions are not allowed in publication WHERE expressions");
err = _("grouping operations are not allowed in publication WHERE expressions");
err = _("window functions are not allowed in publication WHERE expressions");
errmsg("functions are not allowed in publication WHERE expressions"),
err = _("set-returning functions are not allowed in publication WHERE
expressions");


(3) The current code still allows arbitrary code execution, e.g. via a
user-defined operator:

e.g.
publisher:

CREATE OR REPLACE FUNCTION myop(left_arg INTEGER, right_arg INTEGER)
RETURNS BOOL AS
$$
BEGIN
  RAISE NOTICE 'I can do anything here!';
  RETURN left_arg > right_arg;
 END;
$$ LANGUAGE PLPGSQL VOLATILE;

CREATE OPERATOR >>>> (
  PROCEDURE = myop,
  LEFTARG = INTEGER,
  RIGHTARG = INTEGER
);

CREATE PUBLICATION tap_pub FOR TABLE test_tab WHERE (a >>>> 5);

subscriber:
CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub;


Perhaps add the following after the existing shell error-check in make_op():

/* User-defined operators are not allowed in publication WHERE clauses */
if (pstate->p_expr_kind == EXPR_KIND_PUBLICATION_WHERE && opform->oid
>= FirstNormalObjectId)
    ereport(ERROR,
    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
    errmsg("user-defined operators are not allowed in publication
WHERE expressions"),
    parser_errposition(pstate, location)));


Also, I believe it's also allowing user-defined CASTs (so could add a
similar check to above in transformTypeCast()).
Ideally, it would be preferable to validate/check publication WHERE
expressions in one central place, rather than scattered all over the
place, but that might be easier said than done.
You need to update the patch comment accordingly.


(4) src/backend/replication/pgoutput/pgoutput.c
pgoutput_change()

The 3 added calls to pgoutput_row_filter() are returning from
pgoutput_change(), if false is returned, but instead they should break
from the switch, otherwise cleanup code is missed. This is surely a
bug.

e.g.
(3 similar cases of this)

+ if (!pgoutput_row_filter(relation, NULL, tuple, relentry->qual))
+ return;

should be:

+ if (!pgoutput_row_filter(relation, NULL, tuple, relentry->qual))
+ break;


Regards,
Greg Nancarrow
Fujitsu Australia



pgsql-hackers by date:

Previous
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Disable WAL logging to speed up data loading
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: "debug_invalidate_system_caches_always" is too long