Thread: catalog access with reset GUCs during parallel worker startup

catalog access with reset GUCs during parallel worker startup

From
Andres Freund
Date:
Hi,

I noticed that during ParallelWorkerMain()->RestoreGUCState() we perform
catalog accesses with GUCs being set to wrong values, specifically values that
aren't what's in postgresql.conf, and that haven't been set in the leader.

The reason for this is that

1) RestoreGUCState() is called in a transaction, which signals to several GUC
hooks that they can do catalog access

2) RestoreGUCState() resets all non-skipped GUC values to their "boot_val"
first and then in another pass sets all GUCs that were non-default in the
leader.


This e.g. can lead to parallel worker startup using the wrong fsync or
wal_sync_method value when catalog accesses trigger WAL writes. Luckily
PGC_POSTMASTER GUCs are skipped, otherwise this'd be kinda bad. But it still
doesn't seem good.

Do we really need to reset GUCs to their boot value? Particularly for
PGC_SIGHUP variables I don't think we can do that if we want to do
RestoreGUCState() in a transaction.  It'd obviously involve a bit more code,
but it seems entirely possible to just get rid of the reset phase?

Greetings,

Andres Freund



Re: catalog access with reset GUCs during parallel worker startup

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Do we really need to reset GUCs to their boot value? Particularly for
> PGC_SIGHUP variables I don't think we can do that if we want to do
> RestoreGUCState() in a transaction.  It'd obviously involve a bit more code,
> but it seems entirely possible to just get rid of the reset phase?

In an EXEC_BACKEND build, they're going to start out with those
values anyway.  If you're not happy about the consequences,
"skipping the reset" is not the way to improve matters.

Can we arrange to absorb the leader's values before starting the
worker's transaction, instead of inside it?

            regards, tom lane



Re: catalog access with reset GUCs during parallel worker startup

From
Andres Freund
Date:
Hi,

On 2022-02-09 18:56:41 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Do we really need to reset GUCs to their boot value? Particularly for
> > PGC_SIGHUP variables I don't think we can do that if we want to do
> > RestoreGUCState() in a transaction.  It'd obviously involve a bit more code,
> > but it seems entirely possible to just get rid of the reset phase?
> 
> In an EXEC_BACKEND build, they're going to start out with those
> values anyway.  If you're not happy about the consequences,
> "skipping the reset" is not the way to improve matters.

Postmaster's GUC state will already be loaded via read_nondefault_variables(),
much earlier in startup. Well before bgworkers get control and before
transaction environment is up.

Setting a watchpoint on enableFsync, in a parallel worker where postmaster
runs with enableFsync=off, shows the following:

Old value = true
New value = false
set_config_option (name=0x7f42bd9ec110 "fsync", value=0x7f42bd9ec000 "false", context=PGC_POSTMASTER,
source=PGC_S_ARGV,action=GUC_ACTION_SET, 
 
    changeVal=true, elevel=21, is_reload=true) at /home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
7760                        set_extra_field(&conf->gen, &conf->gen.extra,
(rr) bt
#0  set_config_option (name=0x7f42bd9ec110 "fsync", value=0x7f42bd9ec000 "false", context=PGC_POSTMASTER,
source=PGC_S_ARGV,action=GUC_ACTION_SET, 
 
    changeVal=true, elevel=21, is_reload=true) at /home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
#1  0x00007f42bcd5f178 in read_nondefault_variables () at
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:10635
#2  0x00007f42bca9ccd1 in SubPostmasterMain (argc=3, argv=0x7f42bd9c93c0) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5086
#3  0x00007f42bc98bcef in main (argc=3, argv=0x7f42bd9c93c0) at
/home/andres/src/postgresql/src/backend/main/main.c:194

Old value = false
New value = true
InitializeOneGUCOption (gconf=0x7f42bd0a32c0 <ConfigureNamesBool+4320>) at
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:5900
5900                    conf->gen.extra = conf->reset_extra = extra;
(rr) bt
#0  InitializeOneGUCOption (gconf=0x7f42bd0a32c0 <ConfigureNamesBool+4320>) at
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:5900
#1  0x00007f42bcd5ff95 in RestoreGUCState (gucstate=0x7f42bc4f1d00) at
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:11135
#2  0x00007f42bc720873 in ParallelWorkerMain (main_arg=545066472) at
/home/andres/src/postgresql/src/backend/access/transam/parallel.c:1408
#3  0x00007f42bca884e5 in StartBackgroundWorker () at
/home/andres/src/postgresql/src/backend/postmaster/bgworker.c:858
#4  0x00007f42bca9cfd1 in SubPostmasterMain (argc=3, argv=0x7f42bd9c93c0) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5230
#5  0x00007f42bc98bcef in main (argc=3, argv=0x7f42bd9c93c0) at
/home/andres/src/postgresql/src/backend/main/main.c:194


Old value = true
New value = false
set_config_option (name=0x7f42bc4f1e04 "fsync", value=0x7f42bc4f1e0a "false", context=PGC_POSTMASTER,
source=PGC_S_ARGV,action=GUC_ACTION_SET, 
 
    changeVal=true, elevel=21, is_reload=true) at /home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
7760                        set_extra_field(&conf->gen, &conf->gen.extra,
(rr) bt
#0  set_config_option (name=0x7f42bc4f1e04 "fsync", value=0x7f42bc4f1e0a "false", context=PGC_POSTMASTER,
source=PGC_S_ARGV,action=GUC_ACTION_SET, 
 
    changeVal=true, elevel=21, is_reload=true) at /home/andres/src/postgresql/src/backend/utils/misc/guc.c:7760
#1  0x00007f42bcd600fc in RestoreGUCState (gucstate=0x7f42bc4f1d00) at
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:11172
#2  0x00007f42bc720873 in ParallelWorkerMain (main_arg=545066472) at
/home/andres/src/postgresql/src/backend/access/transam/parallel.c:1408
#3  0x00007f42bca884e5 in StartBackgroundWorker () at
/home/andres/src/postgresql/src/backend/postmaster/bgworker.c:858
#4  0x00007f42bca9cfd1 in SubPostmasterMain (argc=3, argv=0x7f42bd9c93c0) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5230
#5  0x00007f42bc98bcef in main (argc=3, argv=0x7f42bd9c93c0) at
/home/andres/src/postgresql/src/backend/main/main.c:194




> Can we arrange to absorb the leader's values before starting the
> worker's transaction, instead of inside it?

I assume Robert did it this way for a reason? It'd not be surprising if there
were a bunch of extensions assuming its safe to do catalog accesses if
IsUnderPostmaster or such :(

Greetings,

Andres Freund



Re: catalog access with reset GUCs during parallel worker startup

From
Robert Haas
Date:
On Wed, Feb 9, 2022 at 7:33 PM Andres Freund <andres@anarazel.de> wrote:
> > Can we arrange to absorb the leader's values before starting the
> > worker's transaction, instead of inside it?
>
> I assume Robert did it this way for a reason? It'd not be surprising if there
> were a bunch of extensions assuming its safe to do catalog accesses if
> IsUnderPostmaster or such :(

I remember trying that, and if I remember correctly, it broke with
core GUCs, without any need for extensions in the picture. I don't
remember the details too well, unfortunately, but I think it had to do
with the behavior of some of the check and/or assign hooks.

It's probably time to try it again, because (a) maybe things are
different now, or (b) maybe I did it wrong, and in any event (c) I
can't really remember why it didn't work and we probably want to know
that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: catalog access with reset GUCs during parallel worker startup

From
Andres Freund
Date:
Hi,

I think we need to do something about this, because afaics this has data
corruption risks, albeit with a narrow window.

E.g. consider what happens if the value of wal_sync_method is reset to the
boot value and one of the catalog accesses in GUC check hooks causes WAL
writes (e.g. due to pruning, buffer replacement). One backend doing
fdatasync() and the rest O_DSYNC or vice versa won't work lead to reliable
durability.


On 2022-02-09 21:47:09 -0500, Robert Haas wrote:
> I remember trying that, and if I remember correctly, it broke with
> core GUCs, without any need for extensions in the picture. I don't
> remember the details too well, unfortunately, but I think it had to do
> with the behavior of some of the check and/or assign hooks.
>
> It's probably time to try it again, because (a) maybe things are
> different now, or (b) maybe I did it wrong, and in any event (c) I
> can't really remember why it didn't work and we probably want to know
> that.

The tests fail:
+ERROR:  invalid value for parameter "session_authorization": "andres"
+CONTEXT:  while setting parameter "session_authorization" to "andres"
+parallel worker
+while scanning relation "public.pvactst"

but that's easily worked around. In fact, there's already code to do so, it
just is lacking awareness of session_authorization:

static bool
can_skip_gucvar(struct config_generic *gconf)
...
     * Role must be handled specially because its current value can be an
     * invalid value (for instance, if someone dropped the role since we set
     * it).  So if we tried to serialize it normally, we might get a failure.
     * We skip it here, and use another mechanism to ensure the worker has the
     * right value.

Don't those reasons apply just as well to session_authorization as they do to
role?

If I skip session_authorization in can_skip_gucvar() and move
RestoreGUCState() out of the transaction, all tests pass.

Unsurprisingly we get bogus results for parallel accesses to
session_authorization:

postgres[3312622][1]=> SELECT current_setting('session_authorization');
┌─────────────────┐
│ current_setting │
├─────────────────┤
│ frak            │
└─────────────────┘
(1 row)

postgres[3312622][1]=> SET force_parallel_mode = on;
SET
postgres[3312622][1]=> SELECT current_setting('session_authorization');
┌─────────────────┐
│ current_setting │
├─────────────────┤
│ andres          │
└─────────────────┘
(1 row)

but that could be remedied just already done for role.

Greetings,

Andres Freund



Re: catalog access with reset GUCs during parallel worker startup

From
David Rowley
Date:
On Wed, 23 Feb 2022 at 15:51, Andres Freund <andres@anarazel.de> wrote:
> The tests fail:
> +ERROR:  invalid value for parameter "session_authorization": "andres"
> +CONTEXT:  while setting parameter "session_authorization" to "andres"
> +parallel worker
> +while scanning relation "public.pvactst"
>
> but that's easily worked around. In fact, there's already code to do so, it
> just is lacking awareness of session_authorization:

I was experimenting with the attached patch which just has
check_session_authorization() return true when not in transaction and
when InitializingParallelWorker is true. It looks like there are
already some other check functions looking at the value of
InitializingParallelWorker (e.g. check_transaction_read_only(),
check_role())

With that done, I do not see any other errors when calling
RestoreGUCState() while not in transaction.  The parallel worker seems
to pick up the session_authorization GUC correctly using the same
debug_parallel_query test that you tried out.

On looking at the other check functions, I noted down the following as
ones that may behave differently when called while not in transaction:

check_transaction_read_only --- checks InitializingParallelWorker
check_transaction_deferrable -- checks
IsSubTransaction/FirstSnapshotSet, returns true if not
check_client_encoding -- Checks IsTransactionState().  Behavioural change
check_default_table_access_method -- accesses catalog if in xact.
Behavioural change
check_default_tablespace -- accesses catalog if in xact. Behavioural change
check_temp_tablespaces -- accesses catalog if in xact. Behavioural change
check_role -- Returns false if not in xact. Handled specially in
RestoreGUCState()
check_session_authorization -- Modify by patch to return true when not
in xact and parallel worker starting
check_timezone -- calls interval_in. Can't see any catalog lookup there.
check_default_text_search_config -- Checks in IsTransactionState()
check_transaction_isolation -- Check in IsTransactionState()

It seems that RestoreLibraryState() was called while in transaction so
that extension GUCs could be initialized before restoring the GUCs.
So there could be repercussions for extensions if we did this.

Does anyone see any reasons why we can't fix this with the attached?

David

Attachment

Re: catalog access with reset GUCs during parallel worker startup

From
David Rowley
Date:
On Thu, 10 Feb 2022 at 13:33, Andres Freund <andres@anarazel.de> wrote:
> Postmaster's GUC state will already be loaded via read_nondefault_variables(),
> much earlier in startup. Well before bgworkers get control and before
> transaction environment is up.
>
> Setting a watchpoint on enableFsync, in a parallel worker where postmaster
> runs with enableFsync=off, shows the following:

The following comment in RestoreGUCState() starting with "since the
leader's" seems to indicate this step is required.

> * but already have their default values.  Thus, this ends up being the
> * same test that SerializeGUCState uses, even though the sets of
> * variables involved may well be different since the leader's set of
> * variables-not-at-default-values can differ from the set that are
>  * not-default in this freshly started worker.

I'm just not quite clear on which cases this could be.   Looking at
InitializeOneGUCOption(), the newval comes from conf->boot_val, which,
for built-in GUCs just comes from the ConfigureNames* table in
guc_tables.c. Perhaps there could be variances in values that are
passed during the call to DefineCustom*Variable between the leader and
the worker in some extension's code.

David