Thread: pg_subscription.subslotname is wrongly marked NOT NULL
In all branches back to v10, initdb marks pg_subscription.subslotname as NOT NULL: # \d pg_subscription Table "pg_catalog.pg_subscription" Column | Type | Collation | Nullable | Default -----------------+---------+-----------+----------+--------- oid | oid | | not null | subdbid | oid | | not null | subname | name | | not null | subowner | oid | | not null | subenabled | boolean | | not null | subbinary | boolean | | not null | subconninfo | text | C | not null | subslotname | name | | not null | subsynccommit | text | C | not null | subpublications | text[] | C | not null | Nonetheless, CREATE/ALTER SUBSCRIPTION blithely set it to null when slot_name = NONE is specified. This apparently causes few ill effects, unless somebody decides to JIT-compile deconstruction of pg_subscription tuples. Which is why all of Andres' JIT-enabled buildfarm animals are unhappy with 9de77b545 --- quite unintentionally, that commit added a test case that exposed the problem. What would we like to do about this? Removing the NOT NULL marking wouldn't be too hard in HEAD, but telling users to fix it manually in the back branches seems like a mess. On the whole it seems like changing the code to use some other representation of slot_name = NONE, like say an empty string, would be less of a mess. It's also a bit annoying that we have no mechanized checks that would catch this inconsistency. If JIT is going to be absolutely dependent on NOT NULL markings being accurate, we can't really have such a laissez-faire attitude to C code getting it wrong. Thoughts? regards, tom lane
I wrote: > In all branches back to v10, initdb marks pg_subscription.subslotname > as NOT NULL: ... > Nonetheless, CREATE/ALTER SUBSCRIPTION blithely set it to null > when slot_name = NONE is specified. > What would we like to do about this? Removing the NOT NULL > marking wouldn't be too hard in HEAD, but telling users to > fix it manually in the back branches seems like a mess. After further thought, it seems like changing the definition that subslotname is null for "NONE" is unworkable, because client-side code might be depending on that. (pg_dump certainly is; we could change that, but other code might have the same expectation.) What I propose we do is (1) Fix the NOT NULL marking in HEAD and v13. We could perhaps alter it in older branches as well, but we cannot force initdb so such a change would only affect newly-initdb'd installations. (2) In pre-v13 branches, hack the JIT tuple deconstruction code to be specifically aware that it can't trust attnotnull for pg_subscription.subslotname. Yeah, it's ugly, but at least it's not ugly going forwards. I haven't looked to see where or how we might do (2), but I assume it's possible. > It's also a bit annoying that we have no mechanized checks that > would catch this inconsistency. If JIT is going to be absolutely > dependent on NOT NULL markings being accurate, we can't really > have such a laissez-faire attitude to C code getting it wrong. It seems like at least in assert-enabled builds, we'd better have a cross-check for that. I'm not sure where's the best place. regards, tom lane
I wrote: > (2) In pre-v13 branches, hack the JIT tuple deconstruction code > to be specifically aware that it can't trust attnotnull for > pg_subscription.subslotname. Yeah, it's ugly, but at least it's > not ugly going forwards. Concretely, as attached for v12. regards, tom lane diff --git a/src/backend/jit/llvm/llvmjit_deform.c b/src/backend/jit/llvm/llvmjit_deform.c index 835aea83e9..4b2144e1a7 100644 --- a/src/backend/jit/llvm/llvmjit_deform.c +++ b/src/backend/jit/llvm/llvmjit_deform.c @@ -22,11 +22,24 @@ #include "access/htup_details.h" #include "access/tupdesc_details.h" +#include "catalog/pg_subscription.h" #include "executor/tuptable.h" #include "jit/llvmjit.h" #include "jit/llvmjit_emit.h" +/* + * Through an embarrassing oversight, pre-v13 installations may have + * pg_subscription.subslotname marked as attnotnull, which it should not be. + * To avoid possible crashes, use this macro instead of testing attnotnull + * directly. + */ +#define ATTNOTNULL(att) \ + ((att)->attnotnull && \ + ((att)->attrelid != SubscriptionRelationId || \ + (att)->attnum != Anum_pg_subscription_subslotname)) + + /* * Create a function that deforms a tuple of type desc up to natts columns. */ @@ -121,7 +134,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, * combination of attisdropped && attnotnull combination shouldn't * exist. */ - if (att->attnotnull && + if (ATTNOTNULL(att) && !att->atthasmissing && !att->attisdropped) guaranteed_column_number = attnum; @@ -419,7 +432,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, * into account, because if they're present the heaptuple's natts * would have indicated that a slot_getmissingattrs() is needed. */ - if (!att->attnotnull) + if (!ATTNOTNULL(att)) { LLVMBasicBlockRef b_ifnotnull; LLVMBasicBlockRef b_ifnull; @@ -586,6 +599,9 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, * to generate better code. Without that LLVM can't figure out that * the offset might be constant due to the jumps for previously * decoded columns. + * + * Note: these two tests on attnotnull don't need the ATTNOTNULL hack, + * because they are harmless on pg_subscription anyway. */ if (attguaranteedalign) { diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out index e7add9d2b8..3ba1e5dcdd 100644 --- a/src/test/regress/expected/subscription.out +++ b/src/test/regress/expected/subscription.out @@ -147,6 +147,13 @@ DROP SUBSCRIPTION regress_testsub; ERROR: DROP SUBSCRIPTION cannot run inside a transaction block COMMIT; ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); +\dRs+ + List of subscriptions + Name | Owner | Enabled | Publication | Synchronous commit | Conninfo +-----------------+----------------------------+---------+---------------------+--------------------+------------------------------ + regress_testsub | regress_subscription_user2 | f | {testpub2,testpub3} | local | dbname=regress_doesnotexist2 +(1 row) + -- now it works BEGIN; DROP SUBSCRIPTION regress_testsub; diff --git a/src/test/regress/sql/subscription.sql b/src/test/regress/sql/subscription.sql index 9e234ab8b3..1bc58887f7 100644 --- a/src/test/regress/sql/subscription.sql +++ b/src/test/regress/sql/subscription.sql @@ -109,6 +109,8 @@ COMMIT; ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); +\dRs+ + -- now it works BEGIN; DROP SUBSCRIPTION regress_testsub;
I wrote: >> It's also a bit annoying that we have no mechanized checks that >> would catch this inconsistency. If JIT is going to be absolutely >> dependent on NOT NULL markings being accurate, we can't really >> have such a laissez-faire attitude to C code getting it wrong. > It seems like at least in assert-enabled builds, we'd better have > a cross-check for that. I'm not sure where's the best place. I concluded that we should put this into CatalogTupleInsert and CatalogTupleUpdate. The bootstrap data path already has a check (see InsertOneNull()), and so does the executor, so we only need to worry about tuples that're built manually by catalog manipulation code. I think all of that goes through these functions. Hence, as attached. ... and apparently, I should have done this task first, because damn if it didn't immediately expose another bug of the same ilk. pg_subscription_rel.srsublsn also needs to be marked nullable. regards, tom lane diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c index d63fcf58cf..fe277f3ad3 100644 --- a/src/backend/catalog/indexing.c +++ b/src/backend/catalog/indexing.c @@ -167,6 +167,43 @@ CatalogIndexInsert(CatalogIndexState indstate, HeapTuple heapTuple) ExecDropSingleTupleTableSlot(slot); } +/* + * Subroutine to verify that catalog constraints are honored. + * + * Tuples inserted via CatalogTupleInsert/CatalogTupleUpdate are generally + * "hand made", so that it's possible that they fail to satisfy constraints + * that would be checked if they were being inserted by the executor. That's + * a coding error, so we only bother to check for it in assert-enabled builds. + */ +#ifdef USE_ASSERT_CHECKING + +static void +CatalogTupleCheckConstraints(Relation heapRel, HeapTuple tup) +{ + /* + * Currently, the only constraints implemented for system catalogs are + * attnotnull constraints. + */ + if (HeapTupleHasNulls(tup)) + { + TupleDesc tupdesc = RelationGetDescr(heapRel); + bits8 *bp = tup->t_data->t_bits; + + for (int attnum = 0; attnum < tupdesc->natts; attnum++) + { + Form_pg_attribute thisatt = TupleDescAttr(tupdesc, attnum); + + Assert(!(thisatt->attnotnull && att_isnull(attnum, bp))); + } + } +} + +#else /* !USE_ASSERT_CHECKING */ + +#define CatalogTupleCheckConstraints(heapRel, tup) ((void) 0) + +#endif /* USE_ASSERT_CHECKING */ + /* * CatalogTupleInsert - do heap and indexing work for a new catalog tuple * @@ -184,6 +221,8 @@ CatalogTupleInsert(Relation heapRel, HeapTuple tup) { CatalogIndexState indstate; + CatalogTupleCheckConstraints(heapRel, tup); + indstate = CatalogOpenIndexes(heapRel); simple_heap_insert(heapRel, tup); @@ -204,6 +243,8 @@ void CatalogTupleInsertWithInfo(Relation heapRel, HeapTuple tup, CatalogIndexState indstate) { + CatalogTupleCheckConstraints(heapRel, tup); + simple_heap_insert(heapRel, tup); CatalogIndexInsert(indstate, tup); @@ -225,6 +266,8 @@ CatalogTupleUpdate(Relation heapRel, ItemPointer otid, HeapTuple tup) { CatalogIndexState indstate; + CatalogTupleCheckConstraints(heapRel, tup); + indstate = CatalogOpenIndexes(heapRel); simple_heap_update(heapRel, otid, tup); @@ -245,6 +288,8 @@ void CatalogTupleUpdateWithInfo(Relation heapRel, ItemPointer otid, HeapTuple tup, CatalogIndexState indstate) { + CatalogTupleCheckConstraints(heapRel, tup); + simple_heap_update(heapRel, otid, tup); CatalogIndexInsert(indstate, tup);
I wrote: > pg_subscription_rel.srsublsn also needs to be marked nullable. Not only is it wrongly marked attnotnull, but two of the three places that read it are doing so unsafely (ie, as though it *were* non-nullable). So I think we'd better fix it as attached. regards, tom lane diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 80c183b235..13c871d2a8 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7631,7 +7631,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l <structfield>srsublsn</structfield> <type>pg_lsn</type> </para> <para> - End LSN for <literal>s</literal> and <literal>r</literal> states. + Remote LSN of the state change used for synchronization coordination + when in <literal>s</literal> or <literal>r</literal> states, + otherwise null </para></entry> </row> </tbody> diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index e6afb3203e..90bf5cf0c6 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -452,13 +452,20 @@ GetSubscriptionRelations(Oid subid) { Form_pg_subscription_rel subrel; SubscriptionRelState *relstate; + Datum d; + bool isnull; subrel = (Form_pg_subscription_rel) GETSTRUCT(tup); relstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState)); relstate->relid = subrel->srrelid; relstate->state = subrel->srsubstate; - relstate->lsn = subrel->srsublsn; + d = SysCacheGetAttr(SUBSCRIPTIONRELMAP, tup, + Anum_pg_subscription_rel_srsublsn, &isnull); + if (isnull) + relstate->lsn = InvalidXLogRecPtr; + else + relstate->lsn = DatumGetLSN(d); res = lappend(res, relstate); } @@ -504,13 +511,20 @@ GetSubscriptionNotReadyRelations(Oid subid) { Form_pg_subscription_rel subrel; SubscriptionRelState *relstate; + Datum d; + bool isnull; subrel = (Form_pg_subscription_rel) GETSTRUCT(tup); relstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState)); relstate->relid = subrel->srrelid; relstate->state = subrel->srsubstate; - relstate->lsn = subrel->srsublsn; + d = SysCacheGetAttr(SUBSCRIPTIONRELMAP, tup, + Anum_pg_subscription_rel_srsublsn, &isnull); + if (isnull) + relstate->lsn = InvalidXLogRecPtr; + else + relstate->lsn = DatumGetLSN(d); res = lappend(res, relstate); } diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h index c44df04c72..f384f4e7fa 100644 --- a/src/include/catalog/pg_subscription_rel.h +++ b/src/include/catalog/pg_subscription_rel.h @@ -33,8 +33,18 @@ CATALOG(pg_subscription_rel,6102,SubscriptionRelRelationId) Oid srsubid; /* Oid of subscription */ Oid srrelid; /* Oid of relation */ char srsubstate; /* state of the relation in subscription */ - XLogRecPtr srsublsn; /* remote lsn of the state change used for - * synchronization coordination */ + + /* + * Although srsublsn is a fixed-width type, it is allowed to be NULL, so + * we prevent direct C code access to it just as for a varlena field. + */ +#ifdef CATALOG_VARLEN /* variable-length fields start here */ + + XLogRecPtr srsublsn BKI_FORCE_NULL; /* remote LSN of the state change + * used for synchronization + * coordination, or NULL if not + * valid */ +#endif } FormData_pg_subscription_rel; typedef FormData_pg_subscription_rel *Form_pg_subscription_rel;
Mopping this up ... the attached patch against v12 shows the portions of 72eab84a5 and 0fa0b487b that I'm thinking of putting into v10-v12. The doc changes, which just clarify that subslotname and srsublsn can be null, should be uncontroversial. The changes in pg_subscription.c prevent it from accessing data that might not be there. 99.999% of the time, that doesn't matter; we'd copy garbage into SubscriptionRelState.lsn, but the callers shouldn't look at that field in states where it's not valid. However, it's possible that the code could access data off the end of the heap page, and at least in theory that could lead to a SIGSEGV. What I'm not quite sure about is whether to add the BKI_FORCE_NULL annotations to the headers or not. There are some pros and cons: * While Catalog.pm has had support for BKI_FORCE_NULL for quite some time, we never used it in anger before yesterday. It's easy to check that it works, but I wonder whether anybody has third-party analysis tools that look at the catalog headers and would get broken because they didn't cover this. * If we change these markings, then our own testing in the buildfarm etc. will not reflect the state of affairs seen in many/most actual v10-v12 installations. The scope of code where it'd matter seems pretty tiny, so I don't think there's a big risk, but there's more than zero risk. (In any case, I would not push this part until all the buildfarm JIT critters have reported happiness with 798b4faef, as that's the one specific spot where it surely does matter.) * On the other side of the ledger, if we don't fix these markings we cannot back-patch the additional assertions I proposed at [1]. I'm kind of leaning to committing this as shown and back-patching the patch at [1], but certainly a case could be made in the other direction. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/298837.1595196283%40sss.pgh.pa.us diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 76ed2fbea8..582f6f66c6 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -6786,8 +6786,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l <entry><structfield>subslotname</structfield></entry> <entry><type>name</type></entry> <entry></entry> - <entry>Name of the replication slot in the upstream database. Also used - for local replication origin name.</entry> + <entry>Name of the replication slot in the upstream database (also used + for the local replication origin name); + null represents <literal>NONE</literal></entry> </row> <row> @@ -6869,7 +6870,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l <entry><type>pg_lsn</type></entry> <entry></entry> <entry> - End LSN for <literal>s</literal> and <literal>r</literal> states. + Remote LSN of the state change used for synchronization coordination + when in <literal>s</literal> or <literal>r</literal> states, + otherwise null </entry> </row> </tbody> diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index afee2838cc..a3a8b4e599 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -457,13 +457,20 @@ GetSubscriptionRelations(Oid subid) { Form_pg_subscription_rel subrel; SubscriptionRelState *relstate; + Datum d; + bool isnull; subrel = (Form_pg_subscription_rel) GETSTRUCT(tup); relstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState)); relstate->relid = subrel->srrelid; relstate->state = subrel->srsubstate; - relstate->lsn = subrel->srsublsn; + d = SysCacheGetAttr(SUBSCRIPTIONRELMAP, tup, + Anum_pg_subscription_rel_srsublsn, &isnull); + if (isnull) + relstate->lsn = InvalidXLogRecPtr; + else + relstate->lsn = DatumGetLSN(d); res = lappend(res, relstate); } @@ -509,13 +516,20 @@ GetSubscriptionNotReadyRelations(Oid subid) { Form_pg_subscription_rel subrel; SubscriptionRelState *relstate; + Datum d; + bool isnull; subrel = (Form_pg_subscription_rel) GETSTRUCT(tup); relstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState)); relstate->relid = subrel->srrelid; relstate->state = subrel->srsubstate; - relstate->lsn = subrel->srsublsn; + d = SysCacheGetAttr(SUBSCRIPTIONRELMAP, tup, + Anum_pg_subscription_rel_srsublsn, &isnull); + if (isnull) + relstate->lsn = InvalidXLogRecPtr; + else + relstate->lsn = DatumGetLSN(d); res = lappend(res, relstate); } diff --git a/src/include/catalog/pg_subscription.h b/src/include/catalog/pg_subscription.h index 3cb13d897e..796c8794ca 100644 --- a/src/include/catalog/pg_subscription.h +++ b/src/include/catalog/pg_subscription.h @@ -53,7 +53,7 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW text subconninfo BKI_FORCE_NOT_NULL; /* Slot name on publisher */ - NameData subslotname; + NameData subslotname BKI_FORCE_NULL; /* Synchronous commit setting for worker */ text subsynccommit BKI_FORCE_NOT_NULL; diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h index f7df814a18..6f5607d604 100644 --- a/src/include/catalog/pg_subscription_rel.h +++ b/src/include/catalog/pg_subscription_rel.h @@ -34,8 +34,18 @@ CATALOG(pg_subscription_rel,6102,SubscriptionRelRelationId) Oid srsubid; /* Oid of subscription */ Oid srrelid; /* Oid of relation */ char srsubstate; /* state of the relation in subscription */ - XLogRecPtr srsublsn; /* remote lsn of the state change used for - * synchronization coordination */ + + /* + * Although srsublsn is a fixed-width type, it is allowed to be NULL, so + * we prevent direct C code access to it just as for a varlena field. + */ +#ifdef CATALOG_VARLEN /* variable-length fields start here */ + + XLogRecPtr srsublsn BKI_FORCE_NULL; /* remote LSN of the state change + * used for synchronization + * coordination, or NULL if not + * valid */ +#endif } FormData_pg_subscription_rel; typedef FormData_pg_subscription_rel *Form_pg_subscription_rel;
I wrote: > * On the other side of the ledger, if we don't fix these markings > we cannot back-patch the additional assertions I proposed at [1]. > I'm kind of leaning to committing this as shown and back-patching > the patch at [1], but certainly a case could be made in the other > direction. Thoughts? After further thought about that I realized that the assertion patch could be kluged in the same way as we did in llvmjit_deform.c, and that that would really be the only safe way to do it pre-v13. Otherwise the assertions would trip in pre-existing databases, which would not be nice. So what I've done is to back-patch the assertions that way, and *not* apply BKI_FORCE_NULL in the back branches. The possible downsides of doing that seem to outweigh the upside of making the catalog state cleaner in new installations. regards, tom lane