Thread: BUG #18545: \dt breaks transaction, calling error when executed in SET SESSION AUTHORIZATION

The following bug has been logged on the website:

Bug reference:      18545
Logged by:          Andrey Rachitskiy
Email address:      therealgofman@mail.ru
PostgreSQL version: 16.3
Operating system:   Debian 12
Description:

\dt breaks transaction, calling error when executed in SET SESSION
AUTHORIZATION. This happens in two cases, with different SET.

Case one:
postgres@debian-test:~$ psql -U postgres
psql (16.3)
Type "help" for help.

postgres=# \set VERBOSITY verbose
postgres=# BEGIN;
BEGIN
postgres=*# CREATE USER regress_priv_user8;
CREATE ROLE
postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
SET
postgres=*> \dt+;
Did not find any relations.
postgres=*> SET LOCAL debug_parallel_query = 1;
SET
postgres=*> \dt+;
ERROR:  22023: role "regress_priv_user8" does not exist
CONTEXT:  while setting parameter "session_authorization" to
"regress_priv_user8"
parallel worker
LOCATION:  call_string_check_hook, guc.c:6734
postgres=!# 
\q

Case two:
postgres@debian-test:~$ psql -U postgres
psql (16.3)
Type "help" for help.

postgres=# \set VERBOSITY verbose
postgres=# BEGIN;
BEGIN
postgres=*# CREATE USER regress_priv_user8;
CREATE ROLE
postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
SET
postgres=*> \dt+
Did not find any relations.
postgres=*> set local parallel_setup_cost = 0;
SET
postgres=*> set local min_parallel_table_scan_size = 0;
SET
postgres=*> \dt+
ERROR:  22023: role "regress_priv_user8" does not exist
CONTEXT:  while setting parameter "session_authorization" to
"regress_priv_user8"
parallel worker
LOCATION:  call_string_check_hook, guc.c:6734
postgres=!# 
\q


PG Bug reporting form <noreply@postgresql.org> writes:
> postgres=# BEGIN;
> BEGIN
> postgres=*# CREATE USER regress_priv_user8;
> CREATE ROLE
> postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
> SET
> postgres=*> SET LOCAL debug_parallel_query = 1;
> SET
> postgres=*> \dt+;
> ERROR:  22023: role "regress_priv_user8" does not exist
> CONTEXT:  while setting parameter "session_authorization" to "regress_priv_user8"
> parallel worker

So this has exactly nothing to do with \dt+; any parallel query
will hit it.  The problem is that parallel workers do
RestoreGUCState() before they've restored the leader's snapshot.
Thus, in this example where session_authorization refers to an
uncommitted pg_authid entry, the workers don't see that entry.
It seems likely that similar failures are possible with other
GUCs that perform catalog lookups.

I experimented with two different ways to fix this:

1.  Run RestoreGUCState() outside a transaction, thus preventing
catalog lookups.  Assume that individual GUC check hooks that
would wish to do a catalog lookup will cope.  Unfortunately,
some of them don't and would need fixed; check_role and
check_session_authorization for two.

2.  Delay RestoreGUCState() into the parallel worker's main
transaction, after we've restored the leader's snapshot.
This turns out to break a different set of check hooks, notably
check_transaction_deferrable.

I think that the blast radius of option 2 is probably smaller than
option 1's, because it should only matter to check hooks that think
they should run before the transaction has set a snapshot, and there
are few of those.  check_transaction_read_only already had a guard,
but I added similar ones to check_transaction_isolation and
check_transaction_deferrable.

The attached draft patch also contains changes to prevent
check_session_authorization from doing anything during parallel
worker startup.  That's left over from experimenting with option 1,
and is not strictly necessary with option 2.  I left it in anyway
because it's saving some unnecessary work.  (For some reason,
check_role seems not to fail if you modify the test case to use
SET ROLE.  I did not figure out why not.  I kind of want to modify
check_role to be a no-op too when InitializingParallelWorker,
but did not touch that here pending more investigation.)

Another thing I'm wondering about is whether to postpone
RestoreLibraryState similarly.  Its current placement is said
to be "before restoring GUC values", so it looks a little out
of place now.  Moving it into the main transaction would save
one StartTransactionCommand/CommitTransactionCommand pair
during parallel worker start, which is worth something.
But I think the real argument for it is that if any loaded
libraries try to do catalog lookups during load, we'd rather
that they see the same catalog state the leader does.
As against that, it feels like there's a nonzero risk of
breaking some third-party code if we move that call.

Thoughts?

            regards, tom lane

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb5..06331d906a 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1410,10 +1410,6 @@ ParallelWorkerMain(Datum main_arg)
     libraryspace = shm_toc_lookup(toc, PARALLEL_KEY_LIBRARY, false);
     StartTransactionCommand();
     RestoreLibraryState(libraryspace);
-
-    /* Restore GUC values from launching backend. */
-    gucspace = shm_toc_lookup(toc, PARALLEL_KEY_GUC, false);
-    RestoreGUCState(gucspace);
     CommitTransactionCommand();

     /* Crank up a transaction state appropriate to a parallel worker. */
@@ -1455,6 +1451,14 @@ ParallelWorkerMain(Datum main_arg)
      */
     InvalidateSystemCaches();

+    /*
+     * Restore GUC values from launching backend.  We can't do this earlier,
+     * because GUCs that do catalog lookups (such as session_authorization)
+     * need to see the same database state as the leader.
+     */
+    gucspace = shm_toc_lookup(toc, PARALLEL_KEY_GUC, false);
+    RestoreGUCState(gucspace);
+
     /*
      * Restore current role id.  Skip verifying whether session user is
      * allowed to become this role and blindly restore the leader's state for
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 9345131711..b2a00ae0f0 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -577,14 +577,16 @@ check_transaction_read_only(bool *newval, void **extra, GucSource source)
  * We allow idempotent changes at any time, but otherwise this can only be
  * changed in a toplevel transaction that has not yet taken a snapshot.
  *
- * As in check_transaction_read_only, allow it if not inside a transaction.
+ * As in check_transaction_read_only, allow it if not inside a transaction,
+ * or if restoring state in a parallel worker.
  */
 bool
 check_transaction_isolation(int *newval, void **extra, GucSource source)
 {
     int            newXactIsoLevel = *newval;

-    if (newXactIsoLevel != XactIsoLevel && IsTransactionState())
+    if (newXactIsoLevel != XactIsoLevel &&
+        IsTransactionState() && !InitializingParallelWorker)
     {
         if (FirstSnapshotSet)
         {
@@ -619,6 +621,10 @@ check_transaction_isolation(int *newval, void **extra, GucSource source)
 bool
 check_transaction_deferrable(bool *newval, void **extra, GucSource source)
 {
+    /* Just accept the value when restoring state in a parallel worker */
+    if (InitializingParallelWorker)
+        return true;
+
     if (IsSubTransaction())
     {
         GUC_check_errcode(ERRCODE_ACTIVE_SQL_TRANSACTION);
@@ -811,6 +817,15 @@ check_session_authorization(char **newval, void **extra, GucSource source)
     if (*newval == NULL)
         return true;

+    /*
+     * Do nothing if initializing a parallel worker: ParallelWorkerMain is
+     * responsible for setting the active role.  We just need to absorb the
+     * leader's value of session_authorization in case some user code looks at
+     * it.
+     */
+    if (InitializingParallelWorker)
+        return true;
+
     if (!IsTransactionState())
     {
         /*
@@ -886,7 +901,7 @@ assign_session_authorization(const char *newval, void *extra)
 {
     role_auth_extra *myextra = (role_auth_extra *) extra;

-    /* Do nothing for the boot_val default of NULL */
+    /* Do nothing for the boot_val default of NULL, and in parallel workers */
     if (!myextra)
         return;


Hi, Tom! Thank you for work on the subject. After applying patch, problem is no longer reproducible.
 
---
Best regards,
Andrey Rachitskiy
Postgres Professional: http://postgrespro.com
 
Суббота, 20 июля 2024, 0:04 +05:00 от Tom Lane <tgl@sss.pgh.pa.us>:
 
PG Bug reporting form <noreply@postgresql.org> writes:
> postgres=# BEGIN;
> BEGIN
> postgres=*# CREATE USER regress_priv_user8;
> CREATE ROLE
> postgres=*# SET SESSION AUTHORIZATION regress_priv_user8;
> SET
> postgres=*> SET LOCAL debug_parallel_query = 1;
> SET
> postgres=*> \dt+;
> ERROR: 22023: role "regress_priv_user8" does not exist
> CONTEXT: while setting parameter "session_authorization" to "regress_priv_user8"
> parallel worker

So this has exactly nothing to do with \dt+; any parallel query
will hit it. The problem is that parallel workers do
RestoreGUCState() before they've restored the leader's snapshot.
Thus, in this example where session_authorization refers to an
uncommitted pg_authid entry, the workers don't see that entry.
It seems likely that similar failures are possible with other
GUCs that perform catalog lookups.

I experimented with two different ways to fix this:

1. Run RestoreGUCState() outside a transaction, thus preventing
catalog lookups. Assume that individual GUC check hooks that
would wish to do a catalog lookup will cope. Unfortunately,
some of them don't and would need fixed; check_role and
check_session_authorization for two.

2. Delay RestoreGUCState() into the parallel worker's main
transaction, after we've restored the leader's snapshot.
This turns out to break a different set of check hooks, notably
check_transaction_deferrable.

I think that the blast radius of option 2 is probably smaller than
option 1's, because it should only matter to check hooks that think
they should run before the transaction has set a snapshot, and there
are few of those. check_transaction_read_only already had a guard,
but I added similar ones to check_transaction_isolation and
check_transaction_deferrable.

The attached draft patch also contains changes to prevent
check_session_authorization from doing anything during parallel
worker startup. That's left over from experimenting with option 1,
and is not strictly necessary with option 2. I left it in anyway
because it's saving some unnecessary work. (For some reason,
check_role seems not to fail if you modify the test case to use
SET ROLE. I did not figure out why not. I kind of want to modify
check_role to be a no-op too when InitializingParallelWorker,
but did not touch that here pending more investigation.)

Another thing I'm wondering about is whether to postpone
RestoreLibraryState similarly. Its current placement is said
to be "before restoring GUC values", so it looks a little out
of place now. Moving it into the main transaction would save
one StartTransactionCommand/CommitTransactionCommand pair
during parallel worker start, which is worth something.
But I think the real argument for it is that if any loaded
libraries try to do catalog lookups during load, we'd rather
that they see the same catalog state the leader does.
As against that, it feels like there's a nonzero risk of
breaking some third-party code if we move that call.

Thoughts?

regards, tom lane
 
 
=?UTF-8?B?0JDQvdC00YDQtdC5INCg0LDRh9C40YbQutC40Lk=?= <therealgofman@mail.ru> writes:
> Hi, Tom! Thank you for work on the subject.  After applying patch, problem is no longer reproducible.

Thanks for checking.  I realized that the idea of making
check_session_authorization a no-op was wrong as presented:
if we don't set up an "extra" struct then guc.c would be
unable to restore the setting later, in case say a function
that's run inside the parallel query has a SET
session_authorization clause.  It's probably possible to
revive that idea with more work, but it's not essential to the
bug fix and we're getting close to the August minor releases.
So I pushed the core bug fix, and I'll take another look at
that part later.

            regards, tom lane



I wrote:
> So I pushed the core bug fix, and I'll take another look at
> that part later.

... or not; the buildfarm didn't like that much.  It'll
have to wait till after these releases, because I'm
overdue to get to work on the release notes.

            regards, tom lane



I wrote:
>> So I pushed the core bug fix, and I'll take another look at
>> that part later.

> ... or not; the buildfarm didn't like that much.  It'll
> have to wait till after these releases, because I'm
> overdue to get to work on the release notes.

The reason the buildfarm found something I'd missed in testing
is that I didn't run check-world with debug_parallel_query set,
which was a bad idea for a patch messing with parallel query
mechanics :-(.  Mea culpa.

However, what the farm found is that assign_client_encoding
is flat-out broken.  It's ignoring the first commandment
for GUC hooks, which is "Thy assign hooks shalt not fail".
(If we didn't need that, there wouldn't be a separation
between check hooks and assign hooks in the first place.)

Because it's throwing an error at the wrong time, it spits
up if it sees an (irrelevant) rollback of client_encoding
during cleanup of a failed parallel worker.  We didn't see
this before because GUCRestoreState was being run in a separate
mini-transaction that (usually at least) doesn't fail.

The attached correction basically just moves that test into
check_client_encoding where it should have been to begin with.
After applying this, I can un-revert f5f30c22e and everything
passes.

However, this episode definitely gives me pause about back-patching
f5f30c22e as I did before.  It seems not impossible that there
are extensions with similarly mis-coded assign hooks, and if
so those are going to need to be fixed.  (I did check that none
of the other core GUCs have this problem.  assign_recovery_target
and friends would, except that they are for PGC_POSTMASTER variables
that won't be getting changed by GUCRestoreState.  Anyway they're a
known kluge, and this patch isn't making it worse.)  So we'd better
treat this as a minor API change, which means we probably shouldn't
put it in stable branches.

What I'm currently thinking is to apply in HEAD and perhaps v17,
but not further back.  Given that this bug has existed since
the beginning of parallel query yet wasn't reported till now,
it's not sufficiently problematic to take any risk for in
stable branches.

Any opinions about whether it's too late to do this in v17?
Post-beta3 is pretty late, for sure, but maybe we could get
away with it.  And we are fixing a bug here.

            regards, tom lane

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f44d942aa4..6ba6d08b24 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -690,6 +690,24 @@ check_client_encoding(char **newval, void **extra, GucSource source)
     /* Get the canonical name (no aliases, uniform case) */
     canonical_name = pg_encoding_to_char(encoding);

+    /*
+     * Parallel workers send data to the leader, not the client.  They always
+     * send data using the database encoding; therefore, we should never
+     * actually change the client encoding in a parallel worker.  However,
+     * during parallel worker startup, we want to accept the leader's
+     * client_encoding setting so that anyone who looks at the value in the
+     * worker sees the same value that they would see in the leader.  A change
+     * other than during startup, for example due to a SET clause attached to
+     * a function definition, should be rejected, as there is nothing we can
+     * do inside the worker to make it take effect.
+     */
+    if (IsParallelWorker() && !InitializingParallelWorker)
+    {
+        GUC_check_errcode(ERRCODE_INVALID_TRANSACTION_STATE);
+        GUC_check_errdetail("Cannot change \"client_encoding\" during a parallel operation.");
+        return false;
+    }
+
     /*
      * If we are not within a transaction then PrepareClientEncoding will not
      * be able to look up the necessary conversion procs.  If we are still
@@ -700,11 +718,15 @@ check_client_encoding(char **newval, void **extra, GucSource source)
      * It seems like a bad idea for client_encoding to change that way anyhow,
      * so we don't go out of our way to support it.
      *
+     * In a parallel worker, we might as well skip PrepareClientEncoding since
+     * we're not going to use its results.
+     *
      * Note: in the postmaster, or any other process that never calls
      * InitializeClientEncoding, PrepareClientEncoding will always succeed,
      * and so will SetClientEncoding; but they won't do anything, which is OK.
      */
-    if (PrepareClientEncoding(encoding) < 0)
+    if (!IsParallelWorker() &&
+        PrepareClientEncoding(encoding) < 0)
     {
         if (IsTransactionState())
         {
@@ -758,28 +780,11 @@ assign_client_encoding(const char *newval, void *extra)
     int            encoding = *((int *) extra);

     /*
-     * Parallel workers send data to the leader, not the client.  They always
-     * send data using the database encoding.
+     * In a parallel worker, we never override the client encoding that was
+     * set by ParallelWorkerMain().
      */
     if (IsParallelWorker())
-    {
-        /*
-         * During parallel worker startup, we want to accept the leader's
-         * client_encoding setting so that anyone who looks at the value in
-         * the worker sees the same value that they would see in the leader.
-         */
-        if (InitializingParallelWorker)
-            return;
-
-        /*
-         * A change other than during startup, for example due to a SET clause
-         * attached to a function definition, should be rejected, as there is
-         * nothing we can do inside the worker to make it take effect.
-         */
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-                 errmsg("cannot change \"client_encoding\" during a parallel operation")));
-    }
+        return;

     /* We do not expect an error if PrepareClientEncoding succeeded */
     if (SetClientEncoding(encoding) < 0)

On Sun, Aug 4, 2024 at 3:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Any opinions about whether it's too late to do this in v17?
Post-beta3 is pretty late, for sure, but maybe we could get
away with it.  And we are fixing a bug here.


If this isn't going to appear in the beta3 build I'd say it's probably too late given the target audience for waiting on this is extension authors.  If it is going into beta3 then I'd vote to allow it.

Feels like this dynamic should be covered as part of our recent attempt to better communicate our policies to our extension authoring community.


Something like:

Namely, beta releases do not constitute a minor release under our policies and updates to our API/ABIs can happen during beta at any point - whether for features newly added in the under- development major version or not.  Extension authors are thus encouraged to test their extensions against the RC build at minimum should they wish for their extension to be ready when the initial release comes out.  Tests against beta versions are very helpful to all interested parties but there is no guarantee that tests that pass any given beta release will pass when performed against the release candidate.  For the release candidate we will use the same patching policy as for a normal minor release.  Any exceptions will necessitate a second release candidate.

The above wording allows us to put this patch into beta3, which I'd be fine with.  But I'd also be fine with adding wording like: "Changes introduced after the final beta is released for testing will [generally?] be limited to fixing items conforming to the Open Item policy."  Probably favor the latter too by a small margin.

David J.


"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Sun, Aug 4, 2024 at 3:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Any opinions about whether it's too late to do this in v17?
>> Post-beta3 is pretty late, for sure, but maybe we could get
>> away with it.  And we are fixing a bug here.

> If this isn't going to appear in the beta3 build I'd say it's probably too
> late given the target audience for waiting on this is extension authors.
> If it is going into beta3 then I'd vote to allow it.

Nope, it's definitely not going into beta3; it's about two days
too late for that.

I agree fixing it in HEAD only is the more conservative course.
To do otherwise, we'd have to rank the #18545 bug as fairly
important, and I'm not sure I buy that given how long it took to
notice it.  But I was curious to see if anyone felt differently.

            regards, tom lane



On Sun, Aug 4, 2024 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, what the farm found is that assign_client_encoding
> is flat-out broken.  It's ignoring the first commandment
> for GUC hooks, which is "Thy assign hooks shalt not fail".
> (If we didn't need that, there wouldn't be a separation
> between check hooks and assign hooks in the first place.)

Interesting. Looks like my mistake, dating to
10c0558ffefcd12bf1d3dc35587eba41d1ce4571. I'm honestly kind of
surprised that nobody discovered this problem for 8 years. I would
have expected it to cause more problems.

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Aug 4, 2024 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, what the farm found is that assign_client_encoding
>> is flat-out broken.  It's ignoring the first commandment
>> for GUC hooks, which is "Thy assign hooks shalt not fail".

> Interesting. Looks like my mistake, dating to
> 10c0558ffefcd12bf1d3dc35587eba41d1ce4571. I'm honestly kind of
> surprised that nobody discovered this problem for 8 years. I would
> have expected it to cause more problems.

Yeah, it's a bit accidental that that's not reachable up to now.
Or I think it's not reachable, anyway.  If we find out differently
we can back-patch 0ae5b763e, but for now I refrained.

            regards, tom lane