Thread: Refactoring SysCacheGetAttr to know when attr cannot be NULL

Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
Today we have two fairly common patterns around extracting an attr from a
cached tuple:

  a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull);
  Assert(!isnull);

  a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull);
  if (isnull)
    elog(ERROR, "..");

The error message in the elog() cases also vary quite a lot.  I've been unable
to find much in terms of guidelines for when to use en elog or an Assert, with
the likelyhood of a NULL value seemingly being the guiding principle (but not
in all cases IIUC).

The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
SysCacheGetAttr where a NULL value triggers an elog().  This removes a lot of
boilerplate error handling which IMO leads to increased readability as the
error handling *in these cases* don't add much (there are other cases where
checking isnull does a lot of valuable work of course).  Personally I much
prefer the error-out automatically style of APIs like how palloc saves a ton of
checking the returned allocation for null, this aims at providing a similar
abstraction.

This will reduce granularity of error messages, and as the patch sits now it
does so a lot since the message is left to work on - I wanted to see if this
was at all seen as a net positive before spending time on that part.  I chose
an elog since I as a user would prefer to hit an elog instead of a silent keep
going with an assert, this is of course debateable.

Thoughts?

--
Daniel Gustafsson


Attachment

Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
> SysCacheGetAttr where a NULL value triggers an elog().

+1, seems like a good idea.  (I didn't review the patch details.)

> This will reduce granularity of error messages, and as the patch sits now it
> does so a lot since the message is left to work on - I wanted to see if this
> was at all seen as a net positive before spending time on that part.  I chose
> an elog since I as a user would prefer to hit an elog instead of a silent keep
> going with an assert, this is of course debateable.

I'd venture that the Assert cases are mostly from laziness, and
that once we centralize this it's plenty worthwhile to generate
a decent elog message.  You ought to be able to look up the
table and column name from the info that is at hand.

Also ... at least in assert-enabled builds, maybe we could check that
the column being fetched this way is actually marked attnotnull?
That would help to catch misuse.

            regards, tom lane



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Peter Eisentraut
Date:
On 28.02.23 21:14, Daniel Gustafsson wrote:
> Today we have two fairly common patterns around extracting an attr from a
> cached tuple:
> 
>    a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull);
>    Assert(!isnull);
> 
>    a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull);
>    if (isnull)
>      elog(ERROR, "..");

> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
> SysCacheGetAttr where a NULL value triggers an elog().  This removes a lot of
> boilerplate error handling which IMO leads to increased readability as the
> error handling *in these cases* don't add much (there are other cases where
> checking isnull does a lot of valuable work of course).  Personally I much
> prefer the error-out automatically style of APIs like how palloc saves a ton of
> checking the returned allocation for null, this aims at providing a similar
> abstraction.

Yes please!

I have occasionally wondered whether just passing the isnull argument as 
NULL would be sufficient, so we don't need a new function.




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Yes please!

> I have occasionally wondered whether just passing the isnull argument as 
> NULL would be sufficient, so we don't need a new function.

I thought about that too.  I think I prefer Daniel's formulation
with the new function, but I'm not especially set on that.

An advantage of using a new function name is it'd be more obvious
what's wrong if you try to back-patch such code into a branch that
lacks the feature.  (Or, of course, we could back-patch the feature.)

            regards, tom lane



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
> On 1 Mar 2023, at 21:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
>> Yes please!
>
>> I have occasionally wondered whether just passing the isnull argument as
>> NULL would be sufficient, so we don't need a new function.
>
> I thought about that too.  I think I prefer Daniel's formulation
> with the new function, but I'm not especially set on that.

I prefer the new function since the name makes the code self documented rather
than developers not used to the API having to look up what the last NULL
actually means.

--
Daniel Gustafsson




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Peter Eisentraut
Date:
On 28.02.23 21:14, Daniel Gustafsson wrote:
> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
> SysCacheGetAttr where a NULL value triggers an elog().  This removes a lot of
> boilerplate error handling which IMO leads to increased readability as the
> error handling *in these cases* don't add much (there are other cases where
> checking isnull does a lot of valuable work of course).  Personally I much
> prefer the error-out automatically style of APIs like how palloc saves a ton of
> checking the returned allocation for null, this aims at providing a similar
> abstraction.

I looked through the patch.  The changes look ok to me.  In some cases, 
more line breaks could be removed (that is, the whole call could be put 
on one line now).

> This will reduce granularity of error messages, and as the patch sits now it
> does so a lot since the message is left to work on - I wanted to see if this
> was at all seen as a net positive before spending time on that part.  I chose
> an elog since I as a user would prefer to hit an elog instead of a silent keep
> going with an assert, this is of course debateable.

I think an error message like

     "unexpected null value in system cache %d column %d"

is sufficient.  Since these are "can't happen" errors, we don't need to 
spend too much extra effort to make it prettier.

I don't think the unlikely() is going to buy much.  If you are worried 
on that level, SysCacheGetAttrNotNull() ought to be made inline. 
Looking through the sites of the changes, I didn't find any callers 
where I'd be worried on that level.




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
> On 1 Mar 2023, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Also ... at least in assert-enabled builds, maybe we could check that
> the column being fetched this way is actually marked attnotnull?
> That would help to catch misuse.

We could, but that would limit the API to attnotnull columns, rather than when
the caller knows that the attr cannot be NULL either due to attnotnull or due
to intrinsic knowledge based on what is being extracted.

An example of the latter is build_function_result_tupdesc_t() which knows that
proallargtypes cannot be NULL when calling SysCacheGetAttr.

I think I prefer to allow those cases rather than the strict mode where
attnotnull has to be true, do you think it's preferrable to align the API with
attnotnull and keep the current coding for cases like the above?

--
Daniel Gustafsson




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
> On 2 Mar 2023, at 10:59, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 28.02.23 21:14, Daniel Gustafsson wrote:
>> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around
>> SysCacheGetAttr where a NULL value triggers an elog().  This removes a lot of
>> boilerplate error handling which IMO leads to increased readability as the
>> error handling *in these cases* don't add much (there are other cases where
>> checking isnull does a lot of valuable work of course).  Personally I much
>> prefer the error-out automatically style of APIs like how palloc saves a ton of
>> checking the returned allocation for null, this aims at providing a similar
>> abstraction.
>
> I looked through the patch.

Thanks!

> The changes look ok to me.  In some cases, more line breaks could be removed (that is, the whole call could be put on
oneline now). 

I've tried to find those that would fit on a single line in the attached v2.

>> This will reduce granularity of error messages, and as the patch sits now it
>> does so a lot since the message is left to work on - I wanted to see if this
>> was at all seen as a net positive before spending time on that part.  I chose
>> an elog since I as a user would prefer to hit an elog instead of a silent keep
>> going with an assert, this is of course debateable.
>
> I think an error message like
>
>    "unexpected null value in system cache %d column %d"
>
> is sufficient.  Since these are "can't happen" errors, we don't need to spend too much extra effort to make it
prettier.

They really should never happen, but since we have all the information we need
it seems reasonable to ease debugging.  I've made a slightly extended elog in
the attached patch.

Callsites which had a detailed errormessage have been left passing isnull, like
for example statext_expressions_load().

> I don't think the unlikely() is going to buy much.  If you are worried on that level, SysCacheGetAttrNotNull() ought
tobe made inline. Looking through the sites of the changes, I didn't find any callers where I'd be worried on that
level.

Fair enough, removed.

--
Daniel Gustafsson


Attachment

Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
>> On 1 Mar 2023, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also ... at least in assert-enabled builds, maybe we could check that
>> the column being fetched this way is actually marked attnotnull?

> We could, but that would limit the API to attnotnull columns, rather than when
> the caller knows that the attr cannot be NULL either due to attnotnull or due
> to intrinsic knowledge based on what is being extracted.
> An example of the latter is build_function_result_tupdesc_t() which knows that
> proallargtypes cannot be NULL when calling SysCacheGetAttr.

OK, if there are counterexamples then never mind that.  I don't think
we want to discourage call sites from using this function.

            regards, tom lane



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> I think an error message like
>      "unexpected null value in system cache %d column %d"
> is sufficient.  Since these are "can't happen" errors, we don't need to 
> spend too much extra effort to make it prettier.

I'd at least like to see it give the catalog's OID.  That's easily
convertible to a name, and it doesn't tend to move around across PG
versions, neither of which are true for syscache IDs.

Also, I'm fairly unconvinced that it's a "can't happen" --- this
would be very likely to fire as a result of catalog corruption,
so it would be good if it's at least minimally interpretable
by a non-expert.  Given that we'll now have just one copy of the
code, ISTM there's a good case for doing the small extra work
to report catalog and column by name.

            regards, tom lane



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
> On 2 Mar 2023, at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
>> I think an error message like
>>     "unexpected null value in system cache %d column %d"
>> is sufficient.  Since these are "can't happen" errors, we don't need to 
>> spend too much extra effort to make it prettier.
> 
> I'd at least like to see it give the catalog's OID.  That's easily
> convertible to a name, and it doesn't tend to move around across PG
> versions, neither of which are true for syscache IDs.
> 
> Also, I'm fairly unconvinced that it's a "can't happen" --- this
> would be very likely to fire as a result of catalog corruption,
> so it would be good if it's at least minimally interpretable
> by a non-expert.  Given that we'll now have just one copy of the
> code, ISTM there's a good case for doing the small extra work
> to report catalog and column by name.

Rebased v3 on top of recent conflicting ICU changes causing the patch to not
apply anymore.  Also took another look around the tree to see if there were
missed callsites but found none new.

--
Daniel Gustafsson


Attachment

Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Peter Eisentraut
Date:
On 13.03.23 14:19, Daniel Gustafsson wrote:
>> On 2 Mar 2023, at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
>>> I think an error message like
>>>      "unexpected null value in system cache %d column %d"
>>> is sufficient.  Since these are "can't happen" errors, we don't need to
>>> spend too much extra effort to make it prettier.
>>
>> I'd at least like to see it give the catalog's OID.  That's easily
>> convertible to a name, and it doesn't tend to move around across PG
>> versions, neither of which are true for syscache IDs.
>>
>> Also, I'm fairly unconvinced that it's a "can't happen" --- this
>> would be very likely to fire as a result of catalog corruption,
>> so it would be good if it's at least minimally interpretable
>> by a non-expert.  Given that we'll now have just one copy of the
>> code, ISTM there's a good case for doing the small extra work
>> to report catalog and column by name.
> 
> Rebased v3 on top of recent conflicting ICU changes causing the patch to not
> apply anymore.  Also took another look around the tree to see if there were
> missed callsites but found none new.

I think the only open question here was the granularity of the error 
message, which I think you had addressed in v2.

+    if (isnull)
+    {
+        elog(ERROR,
+             "unexpected NULL value in cached tuple for pg_catalog.%s.%s",
+             get_rel_name(cacheinfo[cacheId].reloid),
+             NameStr(TupleDescAttr(SysCache[cacheId]->cc_tupdesc, 
attributeNumber - 1)->attname));
+    }

I prefer to use "null value" for SQL null values, and NULL for the C symbol.

I'm a bit hesitant about hardcoding pg_catalog here.  That happens to be 
true, of course, but isn't actually enforced, I think.  I think that 
could be left off.  It's not like people will be confused about which 
schema "pg_class.relname" is in.

Also, the cached tuple isn't really for the attribute, so maybe split 
that up a bit, like

"unexpected null value in cached tuple for catalog %s column %s"



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
David Rowley
Date:
On Tue, 14 Mar 2023 at 02:19, Daniel Gustafsson <daniel@yesql.se> wrote:
> Rebased v3 on top of recent conflicting ICU changes causing the patch to not
> apply anymore.  Also took another look around the tree to see if there were
> missed callsites but found none new.

I had a look at this. It generally seems like a good change.

One thing I thought about while looking is it stage 2 might do
something similar for SearchSysCacheN.  I then wondered if we're more
likely to want to keep the localised __FILE__, __LINE__ and __func__
in the elog for those or not.  It's probably less important that we're
losing those for this change, but worth noting here at least in case
nobody else thought of it.

I only noticed in a couple of places you have a few lines at 80 chars
before the LF. Ideally those would wrap at 79 so that it's 80
including LF. No big deal though.

David



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Peter Eisentraut
Date:
On 23.03.23 09:52, David Rowley wrote:
> One thing I thought about while looking is it stage 2 might do
> something similar for SearchSysCacheN.  I then wondered if we're more
> likely to want to keep the localised __FILE__, __LINE__ and __func__
> in the elog for those or not.  It's probably less important that we're
> losing those for this change, but worth noting here at least in case
> nobody else thought of it.

I don't follow what you are asking for here.



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
David Rowley
Date:
On Fri, 24 Mar 2023 at 20:31, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 23.03.23 09:52, David Rowley wrote:
> > One thing I thought about while looking is it stage 2 might do
> > something similar for SearchSysCacheN.  I then wondered if we're more
> > likely to want to keep the localised __FILE__, __LINE__ and __func__
> > in the elog for those or not.  It's probably less important that we're
> > losing those for this change, but worth noting here at least in case
> > nobody else thought of it.
>
> I don't follow what you are asking for here.

I had two points:

1. Doing something similar for SearchSysCache1 and co might be a good
phase two to this change.
2. With the change Daniel is proposing here, \set VERBOSITY verbose is
not going to print as useful information to tracking down where any
unexpected nulls in the catalogue originates.

For #2, I don't think that's necessarily a problem. I can think of two
reasons why SysCacheGetAttrNotNull might throw an ERROR:

a) We used SysCacheGetAttrNotNull() when we should have used SysCacheGetAttr().
b) Catalogue corruption.

A more localised ERROR message might just help more easily tracking
down type a) problems. I imagine it won't be too difficult to just
grep for all the SysCacheGetAttrNotNull calls for the particular
nullable column to find the one causing the issue. For b), the error
message in SysCacheGetAttrNotNull is sufficient without needing to
know where the SysCacheGetAttrNotNull call came from.

David



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
> On 24 Mar 2023, at 21:59, David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 24 Mar 2023 at 20:31, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

>> On 23.03.23 09:52, David Rowley wrote:
>>> One thing I thought about while looking is it stage 2 might do
>>> something similar for SearchSysCacheN.  I then wondered if we're more
>>> likely to want to keep the localised __FILE__, __LINE__ and __func__
>>> in the elog for those or not.  It's probably less important that we're
>>> losing those for this change, but worth noting here at least in case
>>> nobody else thought of it.
>>
>> I don't follow what you are asking for here.
>
> I had two points:
>
> 1. Doing something similar for SearchSysCache1 and co might be a good
> phase two to this change.

Quite possibly yes, they do follow a pretty repeatable pattern.

> 2. With the change Daniel is proposing here, \set VERBOSITY verbose is
> not going to print as useful information to tracking down where any
> unexpected nulls in the catalogue originates.

Thats a fair point for the elog() removals, for the rather many assertions it
might be a net positive to get a non-local elog when failing in production.

--
Daniel Gustafsson




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Daniel Gustafsson
Date:
> On 14 Mar 2023, at 08:00, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

> I prefer to use "null value" for SQL null values, and NULL for the C symbol.

Thats a fair point, I agree with that.

> I'm a bit hesitant about hardcoding pg_catalog here.  That happens to be true, of course, but isn't actually
enforced,I think.  I think that could be left off.  It's not like people will be confused about which schema
"pg_class.relname"is in. 
>
> Also, the cached tuple isn't really for the attribute, so maybe split that up a bit, like
>
> "unexpected null value in cached tuple for catalog %s column %s"

No objections, so changed to that wording.

With these changes and a pgindent run across it per Davids comment downthread,
I've pushed this now.  Thanks for review!

I'm keeping a watchful eye on the buildfarm; francolin has errored in
recoveryCheck which I'm looking into but at first glance I don't think it's
related (other animals have since passed it and it works locally, but I'll keep
digging at it to make sure).

--
Daniel Gustafsson




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Justin Pryzby
Date:
On Mon, Mar 13, 2023 at 02:19:07PM +0100, Daniel Gustafsson wrote:
> > On 2 Mar 2023, at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > 
> > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> >> I think an error message like
> >>     "unexpected null value in system cache %d column %d"
> >> is sufficient.  Since these are "can't happen" errors, we don't need to 
> >> spend too much extra effort to make it prettier.
> > 
> > I'd at least like to see it give the catalog's OID.  That's easily
> > convertible to a name, and it doesn't tend to move around across PG
> > versions, neither of which are true for syscache IDs.
> > 
> > Also, I'm fairly unconvinced that it's a "can't happen" --- this
> > would be very likely to fire as a result of catalog corruption,
> > so it would be good if it's at least minimally interpretable
> > by a non-expert.  Given that we'll now have just one copy of the
> > code, ISTM there's a good case for doing the small extra work
> > to report catalog and column by name.
> 
> Rebased v3 on top of recent conflicting ICU changes causing the patch to not
> apply anymore.  Also took another look around the tree to see if there were
> missed callsites but found none new.

+++ b/src/backend/utils/cache/syscache.c
@@ -77,6 +77,7 @@
 #include "catalog/pg_user_mapping.h"
 #include "lib/qunique.h"
 #include "utils/catcache.h"
+#include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"

@@ -1099,6 +1100,32 @@ SysCacheGetAttr(int cacheId, HeapTuple tup,
+               elog(ERROR,
+                        "unexpected NULL value in cached tuple for pg_catalog.%s.%s",
+                        get_rel_name(cacheinfo[cacheId].reloid),

Question: Is it safe to be making catalog access inside an error
handler, when one of the most likely reason for hitting the error is
catalog corruption ?

Maybe the answer is that it's not "safe" but "safe enough" - IOW, if
you're willing to throw an assertion, it's good enough to try to show
the table name, and if the error report crashes the server, that's "not
much worse" than having Asserted().

-- 
Justin



Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Mon, Mar 13, 2023 at 02:19:07PM +0100, Daniel Gustafsson wrote:
>> +               elog(ERROR,
>> +                        "unexpected NULL value in cached tuple for pg_catalog.%s.%s",
>> +                        get_rel_name(cacheinfo[cacheId].reloid),

> Question: Is it safe to be making catalog access inside an error
> handler, when one of the most likely reason for hitting the error is
> catalog corruption ?

I don't see a big problem here.  If there were catalog corruption
preventing fetching the catalog's pg_class row, it's highly unlikely
that you'd have managed to retrieve a catalog row to complain about.
(That is, corruption in this particular catalog entry probably does
not extend to the metadata about the catalog containing it.)

> Maybe the answer is that it's not "safe" but "safe enough"

Right.

If we got concerned about this we could dodge the extra catalog access
by adding the catalog's name to CatCache entries.  I doubt it's worth
it though.  We can always re-evaluate if we see actual evidence of
problems.

            regards, tom lane