Thread: logical replication worker accesses catalogs in error context callback

logical replication worker accesses catalogs in error context callback

From

Andres Freund

Date:

06 January 2021, 05:02:29

Hi,

Due to a debug ereport I just noticed that worker.c's
slot_store_error_callback is doing something quite dangerous:

static void
slot_store_error_callback(void *arg)
{
    SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
    LogicalRepRelMapEntry *rel;
    char       *remotetypname;
    Oid            remotetypoid,
                localtypoid;

    /* Nothing to do if remote attribute number is not set */
    if (errarg->remote_attnum < 0)
        return;

    rel = errarg->rel;
    remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];

    /* Fetch remote type name from the LogicalRepTypMap cache */
    remotetypname = logicalrep_typmap_gettypname(remotetypoid);

    /* Fetch local type OID from the local sys cache */
    localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);

    errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
               "remote type %s, local type %s",
               rel->remoterel.nspname, rel->remoterel.relname,
               rel->remoterel.attnames[errarg->remote_attnum],
               remotetypname,
               format_type_be(localtypoid));
}


that's not code that can run in an error context callback. It's
theoretically possible (but unlikely) that
logicalrep_typmap_gettypname() is safe to run in an error context
callback. But get_atttype() definitely isn't.

get_attype() may do catalog accesses. That definitely can't happen
inside an error context callback - the entire transaction might be
borked at this point!


I think this was basically broken from the start in

commit 665d1fad99e7b11678b0d5fa24d2898424243cd6
Author: Peter Eisentraut <peter_e@gmx.net>
Date:   2017-01-19 12:00:00 -0500

    Logical replication

but then got a lot worse in

commit 24c0a6c649768f428929e76dd7f5012ec9b93ce1
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   2018-03-14 21:34:26 -0300

    logical replication: fix OID type mapping mechanism

Greetings,

Andres Freund

Re: logical replication worker accesses catalogs in error context callback

From

Masahiko Sawada

Date:

06 January 2021, 05:42:39

On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Due to a debug ereport I just noticed that worker.c's
> slot_store_error_callback is doing something quite dangerous:
>
> static void
> slot_store_error_callback(void *arg)
> {
>         SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
>         LogicalRepRelMapEntry *rel;
>         char       *remotetypname;
>         Oid                     remotetypoid,
>                                 localtypoid;
>
>         /* Nothing to do if remote attribute number is not set */
>         if (errarg->remote_attnum < 0)
>                 return;
>
>         rel = errarg->rel;
>         remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];
>
>         /* Fetch remote type name from the LogicalRepTypMap cache */
>         remotetypname = logicalrep_typmap_gettypname(remotetypoid);
>
>         /* Fetch local type OID from the local sys cache */
>         localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);
>
>         errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
>                            "remote type %s, local type %s",
>                            rel->remoterel.nspname, rel->remoterel.relname,
>                            rel->remoterel.attnames[errarg->remote_attnum],
>                            remotetypname,
>                            format_type_be(localtypoid));
> }
>
>
> that's not code that can run in an error context callback. It's
> theoretically possible (but unlikely) that
> logicalrep_typmap_gettypname() is safe to run in an error context
> callback. But get_atttype() definitely isn't.
>
> get_attype() may do catalog accesses. That definitely can't happen
> inside an error context callback - the entire transaction might be
> borked at this point!

You're right. Perhaps calling to format_type_be() is also dangerous
since it does catalog access. We should have added the local type
names to SlotErrCallbackArg so we avoid catalog access in the error
context.

I'll try to fix this.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: logical replication worker accesses catalogs in error context callback

From

Masahiko Sawada

Date:

07 January 2021, 03:16:39

On Wed, Jan 6, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > Due to a debug ereport I just noticed that worker.c's
> > slot_store_error_callback is doing something quite dangerous:
> >
> > static void
> > slot_store_error_callback(void *arg)
> > {
> >         SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
> >         LogicalRepRelMapEntry *rel;
> >         char       *remotetypname;
> >         Oid                     remotetypoid,
> >                                 localtypoid;
> >
> >         /* Nothing to do if remote attribute number is not set */
> >         if (errarg->remote_attnum < 0)
> >                 return;
> >
> >         rel = errarg->rel;
> >         remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];
> >
> >         /* Fetch remote type name from the LogicalRepTypMap cache */
> >         remotetypname = logicalrep_typmap_gettypname(remotetypoid);
> >
> >         /* Fetch local type OID from the local sys cache */
> >         localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);
> >
> >         errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
> >                            "remote type %s, local type %s",
> >                            rel->remoterel.nspname, rel->remoterel.relname,
> >                            rel->remoterel.attnames[errarg->remote_attnum],
> >                            remotetypname,
> >                            format_type_be(localtypoid));
> > }
> >
> >
> > that's not code that can run in an error context callback. It's
> > theoretically possible (but unlikely) that
> > logicalrep_typmap_gettypname() is safe to run in an error context
> > callback. But get_atttype() definitely isn't.
> >
> > get_attype() may do catalog accesses. That definitely can't happen
> > inside an error context callback - the entire transaction might be
> > borked at this point!
>
> You're right. Perhaps calling to format_type_be() is also dangerous
> since it does catalog access. We should have added the local type
> names to SlotErrCallbackArg so we avoid catalog access in the error
> context.
>
> I'll try to fix this.

Attached the patch that fixes this issue.

Since logicalrep_typmap_gettypname() could search the sys cache by
calling to format_type_be(), I stored both local and remote type names
to SlotErrCallbackArg so that we can just set the names in the error
callback without sys cache lookup.

Please review it.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

fix_slot_store_error_callback.patch

Re: logical replication worker accesses catalogs in error context callback

From

Bharath Rupireddy

Date:

09 January 2021, 08:57:46

On Thu, Jan 7, 2021 at 5:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 6, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > Due to a debug ereport I just noticed that worker.c's
> > > slot_store_error_callback is doing something quite dangerous:
> > >
> > > static void
> > > slot_store_error_callback(void *arg)
> > > {
> > >         SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
> > >         LogicalRepRelMapEntry *rel;
> > >         char       *remotetypname;
> > >         Oid                     remotetypoid,
> > >                                 localtypoid;
> > >
> > >         /* Nothing to do if remote attribute number is not set */
> > >         if (errarg->remote_attnum < 0)
> > >                 return;
> > >
> > >         rel = errarg->rel;
> > >         remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];
> > >
> > >         /* Fetch remote type name from the LogicalRepTypMap cache */
> > >         remotetypname = logicalrep_typmap_gettypname(remotetypoid);
> > >
> > >         /* Fetch local type OID from the local sys cache */
> > >         localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);
> > >
> > >         errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
> > >                            "remote type %s, local type %s",
> > >                            rel->remoterel.nspname, rel->remoterel.relname,
> > >                            rel->remoterel.attnames[errarg->remote_attnum],
> > >                            remotetypname,
> > >                            format_type_be(localtypoid));
> > > }
> > >
> > >
> > > that's not code that can run in an error context callback. It's
> > > theoretically possible (but unlikely) that
> > > logicalrep_typmap_gettypname() is safe to run in an error context
> > > callback. But get_atttype() definitely isn't.
> > >
> > > get_attype() may do catalog accesses. That definitely can't happen
> > > inside an error context callback - the entire transaction might be
> > > borked at this point!
> >
> > You're right. Perhaps calling to format_type_be() is also dangerous
> > since it does catalog access. We should have added the local type
> > names to SlotErrCallbackArg so we avoid catalog access in the error
> > context.
> >
> > I'll try to fix this.
>
> Attached the patch that fixes this issue.
>
> Since logicalrep_typmap_gettypname() could search the sys cache by
> calling to format_type_be(), I stored both local and remote type names
> to SlotErrCallbackArg so that we can just set the names in the error
> callback without sys cache lookup.
>
> Please review it.

Patch looks good, except one minor comment - can we store
rel->remoterel.atttyps[remoteattnum] into a local variable and use it
in logicalrep_typmap_gettypname, just to save on indentation?

I quickly searched in places where error callbacks are being used, I
think we need a similar kind of fix for conversion_error_callback in
postgres_fdw.c, because get_attname and get_rel_name are being used
which do catalogue lookups. ISTM that all the other error callbacks
are good in the sense that they are not doing sys catalogue lookups.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: logical replication worker accesses catalogs in error context callback

From

Masahiko Sawada

Date:

11 January 2021, 08:25:29

On Sat, Jan 9, 2021 at 2:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jan 7, 2021 at 5:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jan 6, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Jan 6, 2021 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Due to a debug ereport I just noticed that worker.c's
> > > > slot_store_error_callback is doing something quite dangerous:
> > > >
> > > > static void
> > > > slot_store_error_callback(void *arg)
> > > > {
> > > >         SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
> > > >         LogicalRepRelMapEntry *rel;
> > > >         char       *remotetypname;
> > > >         Oid                     remotetypoid,
> > > >                                 localtypoid;
> > > >
> > > >         /* Nothing to do if remote attribute number is not set */
> > > >         if (errarg->remote_attnum < 0)
> > > >                 return;
> > > >
> > > >         rel = errarg->rel;
> > > >         remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];
> > > >
> > > >         /* Fetch remote type name from the LogicalRepTypMap cache */
> > > >         remotetypname = logicalrep_typmap_gettypname(remotetypoid);
> > > >
> > > >         /* Fetch local type OID from the local sys cache */
> > > >         localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);
> > > >
> > > >         errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
> > > >                            "remote type %s, local type %s",
> > > >                            rel->remoterel.nspname, rel->remoterel.relname,
> > > >                            rel->remoterel.attnames[errarg->remote_attnum],
> > > >                            remotetypname,
> > > >                            format_type_be(localtypoid));
> > > > }
> > > >
> > > >
> > > > that's not code that can run in an error context callback. It's
> > > > theoretically possible (but unlikely) that
> > > > logicalrep_typmap_gettypname() is safe to run in an error context
> > > > callback. But get_atttype() definitely isn't.
> > > >
> > > > get_attype() may do catalog accesses. That definitely can't happen
> > > > inside an error context callback - the entire transaction might be
> > > > borked at this point!
> > >
> > > You're right. Perhaps calling to format_type_be() is also dangerous
> > > since it does catalog access. We should have added the local type
> > > names to SlotErrCallbackArg so we avoid catalog access in the error
> > > context.
> > >
> > > I'll try to fix this.
> >
> > Attached the patch that fixes this issue.
> >
> > Since logicalrep_typmap_gettypname() could search the sys cache by
> > calling to format_type_be(), I stored both local and remote type names
> > to SlotErrCallbackArg so that we can just set the names in the error
> > callback without sys cache lookup.
> >
> > Please review it.
>
> Patch looks good, except one minor comment - can we store
> rel->remoterel.atttyps[remoteattnum] into a local variable and use it
> in logicalrep_typmap_gettypname, just to save on indentation?

Thank you for reviewing the patch!

Agreed. Attached the updated patch.

>
> I quickly searched in places where error callbacks are being used, I
> think we need a similar kind of fix for conversion_error_callback in
> postgres_fdw.c, because get_attname and get_rel_name are being used
> which do catalogue lookups. ISTM that all the other error callbacks
> are good in the sense that they are not doing sys catalogue lookups.

Indeed. If we need to disallow the catalog lookup during executing
error callbacks we might want to have an assertion checking that in
SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

fix_slot_store_error_callback_v2.patch

Re: logical replication worker accesses catalogs in error context callback

From

Bharath Rupireddy

Date:

11 January 2021, 10:46:09

On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Agreed. Attached the updated patch.

Thanks for the updated patch. Looks like the comment crosses the 80
char limit, I have corrected it. And also changed the variable name
from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
will not cross the 80 char limit. And also added a commit message.
Attaching v3 patch, please have a look. Both make check and make
check-world passes.

> > I quickly searched in places where error callbacks are being used, I
> > think we need a similar kind of fix for conversion_error_callback in
> > postgres_fdw.c, because get_attname and get_rel_name are being used
> > which do catalogue lookups. ISTM that all the other error callbacks
> > are good in the sense that they are not doing sys catalogue lookups.
>
> Indeed. If we need to disallow the catalog lookup during executing
> error callbacks we might want to have an assertion checking that in
> SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).

I tried to add(as attached in
v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
fails [1]. That means, we are doing a bunch of catalogue access from
error context callbacks. Given this, I'm not quite sure whether we can
have such an assertion in SearchCatCacheInternal.

Although unrelated to what we are discussing here -  when I looked at
SearchCatCacheInternal, I found that the function SearchCatCache has
SearchCatCacheInternal in the function comment, I think we should
correct it. Thoughts? If okay, I will post a separate patch for this
minor comment fix.

[1]
running bootstrap script ... TRAP:
FailedAssertion("error_context_stack", File: "catcache.c", Line: 1220,
PID: 310728)
/home/bharath/workspace/postgres/instnew/bin/postgres(ExceptionalCondition+0xd0)[0x56056984c8ba]
/home/bharath/workspace/postgres/instnew/bin/postgres(+0x76eb1a)[0x560569826b1a]
/home/bharath/workspace/postgres/instnew/bin/postgres(SearchCatCache1+0x3a)[0x5605698269af]
/home/bharath/workspace/postgres/instnew/bin/postgres(SearchSysCache1+0xc1)[0x5605698448b2]
/home/bharath/workspace/postgres/instnew/bin/postgres(get_typtype+0x1f)[0x56056982e389]
/home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeType+0x29)[0x5605692bafe9]
/home/bharath/workspace/postgres/instnew/bin/postgres(CheckAttributeNamesTypes+0x2c9)[0x5605692bafac]
/home/bharath/workspace/postgres/instnew/bin/postgres(heap_create_with_catalog+0x11f)[0x5605692bc436]
/home/bharath/workspace/postgres/instnew/bin/postgres(boot_yyparse+0x7f0)[0x5605692a0d3f]
/home/bharath/workspace/postgres/instnew/bin/postgres(+0x1ecb36)[0x5605692a4b36]
/home/bharath/workspace/postgres/instnew/bin/postgres(AuxiliaryProcessMain+0x5e0)[0x5605692a4997]
/home/bharath/workspace/postgres/instnew/bin/postgres(main+0x268)[0x5605694c6bce]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fa2777ce0b3]
/home/bharath/workspace/postgres/instnew/bin/postgres(_start+0x2e)[0x5605691794de]

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: logical replication worker accesses catalogs in error context callback

From

Masahiko Sawada

Date:

12 January 2021, 07:09:49

On Mon, Jan 11, 2021 at 4:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Agreed. Attached the updated patch.
>
> Thanks for the updated patch. Looks like the comment crosses the 80
> char limit, I have corrected it. And also changed the variable name
> from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
> will not cross the 80 char limit. And also added a commit message.
> Attaching v3 patch, please have a look. Both make check and make
> check-world passes.

Thanks! The change looks good to me.

>
> > > I quickly searched in places where error callbacks are being used, I
> > > think we need a similar kind of fix for conversion_error_callback in
> > > postgres_fdw.c, because get_attname and get_rel_name are being used
> > > which do catalogue lookups. ISTM that all the other error callbacks
> > > are good in the sense that they are not doing sys catalogue lookups.
> >
> > Indeed. If we need to disallow the catalog lookup during executing
> > error callbacks we might want to have an assertion checking that in
> > SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).
>
> I tried to add(as attached in
> v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
> Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
> fails [1]. That means, we are doing a bunch of catalogue access from
> error context callbacks. Given this, I'm not quite sure whether we can
> have such an assertion in SearchCatCacheInternal.

I think checking !error_context_stack is not a correct check if we're
executing an error context callback since it's a stack to store
callbacks. It can be non-NULL by setting an error callback, see
setup_parser_errposition_callback() for instance. Probably we need to
check if (recursion_depth > 0) and elevel. Attached a patch for that
as an example.

>
> Although unrelated to what we are discussing here -  when I looked at
> SearchCatCacheInternal, I found that the function SearchCatCache has
> SearchCatCacheInternal in the function comment, I think we should
> correct it. Thoughts? If okay, I will post a separate patch for this
> minor comment fix.

Perhaps you mean this?

/*
 *  SearchCatCacheInternal
 *
 *      This call searches a system cache for a tuple, opening the relation
 *      if necessary (on the first access to a particular cache).
 *
 *      The result is NULL if not found, or a pointer to a HeapTuple in
 *      the cache.  The caller must not modify the tuple, and must call
 *      ReleaseCatCache() when done with it.
 *
 * The search key values should be expressed as Datums of the key columns'
 * datatype(s).  (Pass zeroes for any unused parameters.)  As a special
 * exception, the passed-in key for a NAME column can be just a C string;
 * the caller need not go to the trouble of converting it to a fully
 * null-padded NAME.
 */
HeapTuple
SearchCatCache(CatCache *cache,

Looking at commit 141fd1b66 it intentionally changed to
SearchCatCacheInternal from SearchCatCache but it seems to me that
it's a typo.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

avoid_sys_cache_lookup_in_error_callback_v2.patch

Re: logical replication worker accesses catalogs in error context callback

From

Bharath Rupireddy

Date:

12 January 2021, 12:33:44

On Tue, Jan 12, 2021 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 11, 2021 at 4:46 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Agreed. Attached the updated patch.
> >
> > Thanks for the updated patch. Looks like the comment crosses the 80
> > char limit, I have corrected it. And also changed the variable name
> > from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
> > will not cross the 80 char limit. And also added a commit message.
> > Attaching v3 patch, please have a look. Both make check and make
> > check-world passes.
>
> Thanks! The change looks good to me.

Thanks!

> > > > I quickly searched in places where error callbacks are being used, I
> > > > think we need a similar kind of fix for conversion_error_callback in
> > > > postgres_fdw.c, because get_attname and get_rel_name are being used
> > > > which do catalogue lookups. ISTM that all the other error callbacks
> > > > are good in the sense that they are not doing sys catalogue lookups.
> > >
> > > Indeed. If we need to disallow the catalog lookup during executing
> > > error callbacks we might want to have an assertion checking that in
> > > SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).
> >
> > I tried to add(as attached in
> > v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
> > Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
> > fails [1]. That means, we are doing a bunch of catalogue access from
> > error context callbacks. Given this, I'm not quite sure whether we can
> > have such an assertion in SearchCatCacheInternal.
>
> I think checking !error_context_stack is not a correct check if we're
> executing an error context callback since it's a stack to store
> callbacks. It can be non-NULL by setting an error callback, see
> setup_parser_errposition_callback() for instance.

Thanks! Yes, you are right, even though we are not processing the
actual error callback, the error_context_stack can be non-null, hence
the Assert(!error_context_stack); doesn't make any sense.

> Probably we need to check if (recursion_depth > 0) and elevel.
> Attached a patch for that as an example.

IIUC, we must be knowing in SearchCatCacheInternal, whether errstart
is called with elevel >= ERROR and we have recursion_depth > 0. If
both are true, then the assertion in SearchCatCacheInternal should
fail. I see that in your patch, elevel is being fetched from
errordata, that's fine. What I'm not quite clear is the following
assumption:

+    /* If we doesn't set any error data yet, assume it's an error */
+    if (errordata_stack_depth == -1)
+        return true;

Is it always true that we are in error processing when
errordata_stack_depth is -1, what happens if errordata_stack_depth <
-1? Maybe I'm missing something.

IMHO, adding an assertion in SearchCatCacheInternal(which is a most
generic code part within the server) with a few error context global
variables may not be always safe. Because we might miss using the
error context variables properly. Instead, we could add a comment in
ErrorContextCallback structure saying something like, "it's not
recommended to access any system catalogues within an error context
callback when the callback is expected to be called while processing
an error, because the transaction might have been broken in that
case." And let the future callback developers take care of it.

Thoughts?

As I said earlier [1], currently only two(there could be more) error
context callbacks access the sys catalogues, one is in
slot_store_error_callback which will be fixed with your patch. Another
is in conversion_error_callback, we can also fix this by storing the
relname, attname and other required info in ConversionLocation,
something like the attached. Please have a look.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v1-0001-Avoid-Catalogue-Accesses-In-conversion_error_call.patch

Re: logical replication worker accesses catalogs in error context callback

From

Masahiko Sawada

Date:

26 January 2021, 08:58:55

On Tue, Jan 12, 2021 at 6:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Jan 12, 2021 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jan 11, 2021 at 4:46 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > Agreed. Attached the updated patch.
> > >
> > > Thanks for the updated patch. Looks like the comment crosses the 80
> > > char limit, I have corrected it. And also changed the variable name
> > > from remotetypeid to remotetypid, so that logicalrep_typmap_gettypname
> > > will not cross the 80 char limit. And also added a commit message.
> > > Attaching v3 patch, please have a look. Both make check and make
> > > check-world passes.
> >
> > Thanks! The change looks good to me.
>
> Thanks!
>
> > > > > I quickly searched in places where error callbacks are being used, I
> > > > > think we need a similar kind of fix for conversion_error_callback in
> > > > > postgres_fdw.c, because get_attname and get_rel_name are being used
> > > > > which do catalogue lookups. ISTM that all the other error callbacks
> > > > > are good in the sense that they are not doing sys catalogue lookups.
> > > >
> > > > Indeed. If we need to disallow the catalog lookup during executing
> > > > error callbacks we might want to have an assertion checking that in
> > > > SearchCatCacheInternal(), in addition to Assert(IsTransactionState()).
> > >
> > > I tried to add(as attached in
> > > v1-0001-Avoid-Sys-Cache-Lookups-in-Error-Callbacks.patch) the
> > > Assert(!error_context_stack); in SearchCatCacheInternal, initdb itself
> > > fails [1]. That means, we are doing a bunch of catalogue access from
> > > error context callbacks. Given this, I'm not quite sure whether we can
> > > have such an assertion in SearchCatCacheInternal.
> >
> > I think checking !error_context_stack is not a correct check if we're
> > executing an error context callback since it's a stack to store
> > callbacks. It can be non-NULL by setting an error callback, see
> > setup_parser_errposition_callback() for instance.
>
> Thanks! Yes, you are right, even though we are not processing the
> actual error callback, the error_context_stack can be non-null, hence
> the Assert(!error_context_stack); doesn't make any sense.
>
> > Probably we need to check if (recursion_depth > 0) and elevel.
> > Attached a patch for that as an example.
>
> IIUC, we must be knowing in SearchCatCacheInternal, whether errstart
> is called with elevel >= ERROR and we have recursion_depth > 0. If
> both are true, then the assertion in SearchCatCacheInternal should
> fail. I see that in your patch, elevel is being fetched from
> errordata, that's fine. What I'm not quite clear is the following
> assumption:
>
> +    /* If we doesn't set any error data yet, assume it's an error */
> +    if (errordata_stack_depth == -1)
> +        return true;
>
> Is it always true that we are in error processing when
> errordata_stack_depth is -1, what happens if errordata_stack_depth <
> -1? Maybe I'm missing something.

You're right. I missed something.

>
> IMHO, adding an assertion in SearchCatCacheInternal(which is a most
> generic code part within the server) with a few error context global
> variables may not be always safe. Because we might miss using the
> error context variables properly. Instead, we could add a comment in
> ErrorContextCallback structure saying something like, "it's not
> recommended to access any system catalogues within an error context
> callback when the callback is expected to be called while processing
> an error, because the transaction might have been broken in that
> case." And let the future callback developers take care of it.
>
> Thoughts?

That sounds good to me. But I still also see the value to add an
assertion in SearchCatCacheInternal(). If we had it, we could find
these two bugs earlier.

Anyway, this seems to be unrelated to this bug fixing so we can start
a new thread for that.

>
> As I said earlier [1], currently only two(there could be more) error
> context callbacks access the sys catalogues, one is in
> slot_store_error_callback which will be fixed with your patch. Another
> is in conversion_error_callback, we can also fix this by storing the
> relname, attname and other required info in ConversionLocation,
> something like the attached. Please have a look.
>
> Thoughts?

Thank you for the patch!

Here are some comments:

+static void set_error_callback_info(ConversionLocation *errpos,
+                                          Relation rel, int cur_attno,
+                                          ForeignScanState *fsstate);

I'm concerned a bit that the function name is generic. How about
set_conversion_error_callback_info() or something to make the name
clear?

---
+static void
+conversion_error_callback(void *arg)
+{
+       ConversionLocation *errpos = (ConversionLocation *) arg;
+
+       if (errpos->show_generic_message)
+               errcontext("processing expression at position %d in
select list",
+                                       errpos->cur_attno);
+

I think we can set this generic message to the error context when
errpos->relname is NULL instead of using show_generic_message.

---
+       /*
+        * Set error context callback info, so that we could avoid accessing
+        * the system catalogues while processing the error in
+        * conversion_error_callback. System catalogue accesses are not safe in
+        * error context callbacks because the transaction might have been
+        * broken by then.
+        */
+       set_error_callback_info(&errpos, rel, i, fsstate);

Looking at other code, we use "catalog" rather than "catalogue" in
most places. Is it better to use "catalog" for consistency? FYI, the
"catalogue" is used at only three places in the server code:

$ git grep "catalogue" -- '*.[ch]'
src/backend/access/brin/brin.c: * This routine scans the complete
index looking for uncatalogued index pages,
src/backend/catalog/pg_constraint.c: * given relation; or InvalidOid
if no such index is catalogued.
src/backend/executor/execMain.c:     * As in case of the catalogued
constraints, we treat a NULL result as

FYI I've added those bug fixes to the next Commitfest[1].

Regards,

[1] https://commitfest.postgresql.org/32/2955/

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: logical replication worker accesses catalogs in error context callback

From

Bharath Rupireddy

Date:

27 January 2021, 04:17:15

On Tue, Jan 26, 2021 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > IMHO, adding an assertion in SearchCatCacheInternal(which is a most
> > generic code part within the server) with a few error context global
> > variables may not be always safe. Because we might miss using the
> > error context variables properly. Instead, we could add a comment in
> > ErrorContextCallback structure saying something like, "it's not
> > recommended to access any system catalogues within an error context
> > callback when the callback is expected to be called while processing
> > an error, because the transaction might have been broken in that
> > case." And let the future callback developers take care of it.
> >
> > Thoughts?
>
> That sounds good to me. But I still also see the value to add an
> assertion in SearchCatCacheInternal(). If we had it, we could find
> these two bugs earlier.
>
> Anyway, this seems to be unrelated to this bug fixing so we can start
> a new thread for that.

+1 to start a new thread for that.

> > As I said earlier [1], currently only two(there could be more) error
> > context callbacks access the sys catalogues, one is in
> > slot_store_error_callback which will be fixed with your patch. Another
> > is in conversion_error_callback, we can also fix this by storing the
> > relname, attname and other required info in ConversionLocation,
> > something like the attached. Please have a look.
> >
> > Thoughts?
>
> Thank you for the patch!
>
> Here are some comments:

Thanks for the review comments.

> +static void set_error_callback_info(ConversionLocation *errpos,
> +                                          Relation rel, int cur_attno,
> +                                          ForeignScanState *fsstate);
>
> I'm concerned a bit that the function name is generic. How about
> set_conversion_error_callback_info() or something to make the name
> clear?

Done.

> ---
> +static void
> +conversion_error_callback(void *arg)
> +{
> +       ConversionLocation *errpos = (ConversionLocation *) arg;
> +
> +       if (errpos->show_generic_message)
> +               errcontext("processing expression at position %d in
> select list",
> +                                       errpos->cur_attno);
> +
>
> I think we can set this generic message to the error context when
> errpos->relname is NULL instead of using show_generic_message.

Right. Done.

> ---
> +       /*
> +        * Set error context callback info, so that we could avoid accessing
> +        * the system catalogues while processing the error in
> +        * conversion_error_callback. System catalogue accesses are not safe in
> +        * error context callbacks because the transaction might have been
> +        * broken by then.
> +        */
> +       set_error_callback_info(&errpos, rel, i, fsstate);
>
> Looking at other code, we use "catalog" rather than "catalogue" in
> most places. Is it better to use "catalog" for consistency? FYI, the
> "catalogue" is used at only three places in the server code:

Changed it to "catalog".

> FYI I've added those bug fixes to the next Commitfest - https://commitfest.postgresql.org/32/2955/

Thanks. I'm attaching 2 patches to make it easy for reviewing and also
they will get a chance to be tested by cf bot.

0001 - for avoiding system catalog access in slot_store_error_callback.
0002 - for avoiding system catalog access in conversion_error_callback

Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: logical replication worker accesses catalogs in error context callback

From

Zhihong Yu

Date:

27 January 2021, 05:19:44

For 0002, a few comments on the description:

bq. Avoid accessing system catalogues inside conversion_error_callback

catalogues -> catalog

bq. error context callback, because the the entire transaction might

There is redundant 'the'

bq. Since get_attname() and get_rel_name() could search the sys cache by

store the required information

store -> storing

Cheers

On Tue, Jan 26, 2021 at 5:17 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Tue, Jan 26, 2021 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > IMHO, adding an assertion in SearchCatCacheInternal(which is a most
> > generic code part within the server) with a few error context global
> > variables may not be always safe. Because we might miss using the
> > error context variables properly. Instead, we could add a comment in
> > ErrorContextCallback structure saying something like, "it's not
> > recommended to access any system catalogues within an error context
> > callback when the callback is expected to be called while processing
> > an error, because the transaction might have been broken in that
> > case." And let the future callback developers take care of it.
> >
> > Thoughts?
>
> That sounds good to me. But I still also see the value to add an
> assertion in SearchCatCacheInternal(). If we had it, we could find
> these two bugs earlier.
>
> Anyway, this seems to be unrelated to this bug fixing so we can start
> a new thread for that.

+1 to start a new thread for that.

> > As I said earlier [1], currently only two(there could be more) error
> > context callbacks access the sys catalogues, one is in
> > slot_store_error_callback which will be fixed with your patch. Another
> > is in conversion_error_callback, we can also fix this by storing the
> > relname, attname and other required info in ConversionLocation,
> > something like the attached. Please have a look.
> >
> > Thoughts?
>
> Thank you for the patch!
>
> Here are some comments:

Thanks for the review comments.

> +static void set_error_callback_info(ConversionLocation *errpos,
> + Relation rel, int cur_attno,
> + ForeignScanState *fsstate);
>
> I'm concerned a bit that the function name is generic. How about
> set_conversion_error_callback_info() or something to make the name
> clear?

Done.

> ---
> +static void
> +conversion_error_callback(void *arg)
> +{
> + ConversionLocation *errpos = (ConversionLocation *) arg;
> +
> + if (errpos->show_generic_message)
> + errcontext("processing expression at position %d in
> select list",
> + errpos->cur_attno);
> +
>
> I think we can set this generic message to the error context when
> errpos->relname is NULL instead of using show_generic_message.

Right. Done.

> ---
> + /*
> + * Set error context callback info, so that we could avoid accessing
> + * the system catalogues while processing the error in
> + * conversion_error_callback. System catalogue accesses are not safe in
> + * error context callbacks because the transaction might have been
> + * broken by then.
> + */
> + set_error_callback_info(&errpos, rel, i, fsstate);
>
> Looking at other code, we use "catalog" rather than "catalogue" in
> most places. Is it better to use "catalog" for consistency? FYI, the
> "catalogue" is used at only three places in the server code:

Changed it to "catalog".

> FYI I've added those bug fixes to the next Commitfest - https://commitfest.postgresql.org/32/2955/

Thanks. I'm attaching 2 patches to make it easy for reviewing and also
they will get a chance to be tested by cf bot.

0001 - for avoiding system catalog access in slot_store_error_callback.
0002 - for avoiding system catalog access in conversion_error_callback

Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: logical replication worker accesses catalogs in error context callback

From

Bharath Rupireddy

Date:

27 January 2021, 07:07:54

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu <zyu@yugabyte.com> wrote:
> For 0002, a few comments on the description:
>
> bq. Avoid accessing system catalogues inside conversion_error_callback
>
> catalogues -> catalog
>
> bq. error context callback, because the the entire transaction might
>
> There is redundant 'the'
>
> bq. Since get_attname() and get_rel_name() could search the sys cache by
> store the required information
>
> store -> storing

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

On Wed, Mar 17, 2021 at 4:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Mar 16, 2021 at 2:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Thanks for pointing to the changes in the commit message. I corrected
> > > them. Attaching v4 patch set, consider it for further review.
> >
> > I took a quick look at this.  I'm quite worried about the potential
> > performance cost of the v4-0001 patch (the one for fixing
> > slot_store_error_callback).  Previously, we didn't pay any noticeable
> > cost for having the callback unless there actually was an error.
> > As patched, we perform several catalog lookups per column per row,
> > even in the non-error code path.  That seems like it'd be a noticeable
> > performance hit.  Just to add insult to injury, it leaks memory.
> >
> > I propose a more radical but simpler solution: let's just not bother
> > with including the type names in the context message.  How much are
> > they really buying?
>
> Thanks. In that case, the message can only return the local and remote
> columns names and ignore the types (something like below). And the
> user will have to figure out what are the types of those columns in
> local and remote separately in case of error. Then the function
> logicalrep_typmap_gettypname can also be removed. I'm not sure if this
> is okay. Thoughts?

Hi Tom,

As suggested earlier, I'm attaching a v5 patch that avoids printing
the column type names in the context message thus no cache lookups
have to be done in the error context callback. I think the column name
is enough to know on which column the error occurred and if required
it's type can be known by the user. This patch gets rid of printing
local and remote type names in slot_store_error_callback and also
logicalrep_typmap_gettypname because it's unnecessary. I'm not sure if
this solution is acceptable. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v5-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patch

Re: logical replication worker accesses catalogs in error context callback

From

Bharath Rupireddy

Date:

16 April 2021, 12:53:14

On Tue, Mar 16, 2021 at 2:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> > Thanks for pointing to the changes in the commit message. I corrected
> > them. Attaching v4 patch set, consider it for further review.
>
> I took a quick look at this.  I'm quite worried about the potential
> performance cost of the v4-0001 patch (the one for fixing
> slot_store_error_callback).  Previously, we didn't pay any noticeable
> cost for having the callback unless there actually was an error.
> As patched, we perform several catalog lookups per column per row,
> even in the non-error code path.  That seems like it'd be a noticeable
> performance hit.  Just to add insult to injury, it leaks memory.
>
> I propose a more radical but simpler solution: let's just not bother
> with including the type names in the context message.  How much are
> they really buying?

<< Attaching v5-0001 here again for completion >>
I'm attaching v5-0001 patch that avoids printing the column type names
in the context message thus no cache lookups have to be done in the
error context callback. I think the column name is enough to know on
which column the error occurred and if required it's type can be known
by the user. This patch gets rid of printing local and remote type
names in slot_store_error_callback and also
logicalrep_typmap_gettypname because it's unnecessary. Thoughts?

> v4-0002 (for postgres_fdw's conversion_error_callback) has the same
> problems, although mitigated a bit by not needing to do any catalog
> lookups in the non-join case.  For the join case, I wonder whether
> we could push the lookups back to plan setup time, so they don't
> need to be done over again for each row.  (Not doing lookups at all
> doesn't seem attractive there; it'd render the context message nearly
> useless.)  A different idea is to try to get the column names out
> of the query's rangetable instead of going to the catalogs.

I'm attaching v5-0002 which stores the required attribute information
for foreign joins in postgresBeginForeignScan which is a one time job
as opposed to the per-row system catalog lookup v4-0001 was doing. I'm
storing the foreign table relid(as key), relname and the retrieved
attributes' attnum and attname into a hash table. Whenever a
conversion error occurs, using relid, the hash table is looked up to
fetch the relname and attname. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: logical replication worker accesses catalogs in error context callback

From

From

Tom Lane

Date:

06 July 2021, 19:37:38

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> How about making the below else if statement and the attname
> assignment into a single line? They are falling below the 80 char
> limit.
>         else if (colno > 0 && colno <= list_length(rte->eref->colnames))
>             attname = strVal(list_nth(rte->eref->colnames, colno - 1));

Pushed that way.  (I think possibly I'd had this code further indented
in its first incarnation, thus the extra line breaks.)

            regards, tom lane