Thread: Refactoring SysCacheGetAttr to know when attr cannot be NULL
Today we have two fairly common patterns around extracting an attr from a cached tuple: a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull); Assert(!isnull); a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull); if (isnull) elog(ERROR, ".."); The error message in the elog() cases also vary quite a lot. I've been unable to find much in terms of guidelines for when to use en elog or an Assert, with the likelyhood of a NULL value seemingly being the guiding principle (but not in all cases IIUC). The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around SysCacheGetAttr where a NULL value triggers an elog(). This removes a lot of boilerplate error handling which IMO leads to increased readability as the error handling *in these cases* don't add much (there are other cases where checking isnull does a lot of valuable work of course). Personally I much prefer the error-out automatically style of APIs like how palloc saves a ton of checking the returned allocation for null, this aims at providing a similar abstraction. This will reduce granularity of error messages, and as the patch sits now it does so a lot since the message is left to work on - I wanted to see if this was at all seen as a net positive before spending time on that part. I chose an elog since I as a user would prefer to hit an elog instead of a silent keep going with an assert, this is of course debateable. Thoughts? -- Daniel Gustafsson
Attachment
Daniel Gustafsson <daniel@yesql.se> writes: > The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around > SysCacheGetAttr where a NULL value triggers an elog(). +1, seems like a good idea. (I didn't review the patch details.) > This will reduce granularity of error messages, and as the patch sits now it > does so a lot since the message is left to work on - I wanted to see if this > was at all seen as a net positive before spending time on that part. I chose > an elog since I as a user would prefer to hit an elog instead of a silent keep > going with an assert, this is of course debateable. I'd venture that the Assert cases are mostly from laziness, and that once we centralize this it's plenty worthwhile to generate a decent elog message. You ought to be able to look up the table and column name from the info that is at hand. Also ... at least in assert-enabled builds, maybe we could check that the column being fetched this way is actually marked attnotnull? That would help to catch misuse. regards, tom lane
On 28.02.23 21:14, Daniel Gustafsson wrote: > Today we have two fairly common patterns around extracting an attr from a > cached tuple: > > a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull); > Assert(!isnull); > > a = SysCacheGetAttr(OID, tuple, Anum_pg_foo_bar, &isnull); > if (isnull) > elog(ERROR, ".."); > The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around > SysCacheGetAttr where a NULL value triggers an elog(). This removes a lot of > boilerplate error handling which IMO leads to increased readability as the > error handling *in these cases* don't add much (there are other cases where > checking isnull does a lot of valuable work of course). Personally I much > prefer the error-out automatically style of APIs like how palloc saves a ton of > checking the returned allocation for null, this aims at providing a similar > abstraction. Yes please! I have occasionally wondered whether just passing the isnull argument as NULL would be sufficient, so we don't need a new function.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > Yes please! > I have occasionally wondered whether just passing the isnull argument as > NULL would be sufficient, so we don't need a new function. I thought about that too. I think I prefer Daniel's formulation with the new function, but I'm not especially set on that. An advantage of using a new function name is it'd be more obvious what's wrong if you try to back-patch such code into a branch that lacks the feature. (Or, of course, we could back-patch the feature.) regards, tom lane
> On 1 Mar 2023, at 21:04, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: >> Yes please! > >> I have occasionally wondered whether just passing the isnull argument as >> NULL would be sufficient, so we don't need a new function. > > I thought about that too. I think I prefer Daniel's formulation > with the new function, but I'm not especially set on that. I prefer the new function since the name makes the code self documented rather than developers not used to the API having to look up what the last NULL actually means. -- Daniel Gustafsson
On 28.02.23 21:14, Daniel Gustafsson wrote: > The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around > SysCacheGetAttr where a NULL value triggers an elog(). This removes a lot of > boilerplate error handling which IMO leads to increased readability as the > error handling *in these cases* don't add much (there are other cases where > checking isnull does a lot of valuable work of course). Personally I much > prefer the error-out automatically style of APIs like how palloc saves a ton of > checking the returned allocation for null, this aims at providing a similar > abstraction. I looked through the patch. The changes look ok to me. In some cases, more line breaks could be removed (that is, the whole call could be put on one line now). > This will reduce granularity of error messages, and as the patch sits now it > does so a lot since the message is left to work on - I wanted to see if this > was at all seen as a net positive before spending time on that part. I chose > an elog since I as a user would prefer to hit an elog instead of a silent keep > going with an assert, this is of course debateable. I think an error message like "unexpected null value in system cache %d column %d" is sufficient. Since these are "can't happen" errors, we don't need to spend too much extra effort to make it prettier. I don't think the unlikely() is going to buy much. If you are worried on that level, SysCacheGetAttrNotNull() ought to be made inline. Looking through the sites of the changes, I didn't find any callers where I'd be worried on that level.
> On 1 Mar 2023, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also ... at least in assert-enabled builds, maybe we could check that > the column being fetched this way is actually marked attnotnull? > That would help to catch misuse. We could, but that would limit the API to attnotnull columns, rather than when the caller knows that the attr cannot be NULL either due to attnotnull or due to intrinsic knowledge based on what is being extracted. An example of the latter is build_function_result_tupdesc_t() which knows that proallargtypes cannot be NULL when calling SysCacheGetAttr. I think I prefer to allow those cases rather than the strict mode where attnotnull has to be true, do you think it's preferrable to align the API with attnotnull and keep the current coding for cases like the above? -- Daniel Gustafsson
> On 2 Mar 2023, at 10:59, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 28.02.23 21:14, Daniel Gustafsson wrote: >> The attached refactoring introduce SysCacheGetAttrNotNull as a wrapper around >> SysCacheGetAttr where a NULL value triggers an elog(). This removes a lot of >> boilerplate error handling which IMO leads to increased readability as the >> error handling *in these cases* don't add much (there are other cases where >> checking isnull does a lot of valuable work of course). Personally I much >> prefer the error-out automatically style of APIs like how palloc saves a ton of >> checking the returned allocation for null, this aims at providing a similar >> abstraction. > > I looked through the patch. Thanks! > The changes look ok to me. In some cases, more line breaks could be removed (that is, the whole call could be put on oneline now). I've tried to find those that would fit on a single line in the attached v2. >> This will reduce granularity of error messages, and as the patch sits now it >> does so a lot since the message is left to work on - I wanted to see if this >> was at all seen as a net positive before spending time on that part. I chose >> an elog since I as a user would prefer to hit an elog instead of a silent keep >> going with an assert, this is of course debateable. > > I think an error message like > > "unexpected null value in system cache %d column %d" > > is sufficient. Since these are "can't happen" errors, we don't need to spend too much extra effort to make it prettier. They really should never happen, but since we have all the information we need it seems reasonable to ease debugging. I've made a slightly extended elog in the attached patch. Callsites which had a detailed errormessage have been left passing isnull, like for example statext_expressions_load(). > I don't think the unlikely() is going to buy much. If you are worried on that level, SysCacheGetAttrNotNull() ought tobe made inline. Looking through the sites of the changes, I didn't find any callers where I'd be worried on that level. Fair enough, removed. -- Daniel Gustafsson
Attachment
Daniel Gustafsson <daniel@yesql.se> writes: >> On 1 Mar 2023, at 00:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also ... at least in assert-enabled builds, maybe we could check that >> the column being fetched this way is actually marked attnotnull? > We could, but that would limit the API to attnotnull columns, rather than when > the caller knows that the attr cannot be NULL either due to attnotnull or due > to intrinsic knowledge based on what is being extracted. > An example of the latter is build_function_result_tupdesc_t() which knows that > proallargtypes cannot be NULL when calling SysCacheGetAttr. OK, if there are counterexamples then never mind that. I don't think we want to discourage call sites from using this function. regards, tom lane
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > I think an error message like > "unexpected null value in system cache %d column %d" > is sufficient. Since these are "can't happen" errors, we don't need to > spend too much extra effort to make it prettier. I'd at least like to see it give the catalog's OID. That's easily convertible to a name, and it doesn't tend to move around across PG versions, neither of which are true for syscache IDs. Also, I'm fairly unconvinced that it's a "can't happen" --- this would be very likely to fire as a result of catalog corruption, so it would be good if it's at least minimally interpretable by a non-expert. Given that we'll now have just one copy of the code, ISTM there's a good case for doing the small extra work to report catalog and column by name. regards, tom lane
> On 2 Mar 2023, at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: >> I think an error message like >> "unexpected null value in system cache %d column %d" >> is sufficient. Since these are "can't happen" errors, we don't need to >> spend too much extra effort to make it prettier. > > I'd at least like to see it give the catalog's OID. That's easily > convertible to a name, and it doesn't tend to move around across PG > versions, neither of which are true for syscache IDs. > > Also, I'm fairly unconvinced that it's a "can't happen" --- this > would be very likely to fire as a result of catalog corruption, > so it would be good if it's at least minimally interpretable > by a non-expert. Given that we'll now have just one copy of the > code, ISTM there's a good case for doing the small extra work > to report catalog and column by name. Rebased v3 on top of recent conflicting ICU changes causing the patch to not apply anymore. Also took another look around the tree to see if there were missed callsites but found none new. -- Daniel Gustafsson
Attachment
On 13.03.23 14:19, Daniel Gustafsson wrote: >> On 2 Mar 2023, at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: >>> I think an error message like >>> "unexpected null value in system cache %d column %d" >>> is sufficient. Since these are "can't happen" errors, we don't need to >>> spend too much extra effort to make it prettier. >> >> I'd at least like to see it give the catalog's OID. That's easily >> convertible to a name, and it doesn't tend to move around across PG >> versions, neither of which are true for syscache IDs. >> >> Also, I'm fairly unconvinced that it's a "can't happen" --- this >> would be very likely to fire as a result of catalog corruption, >> so it would be good if it's at least minimally interpretable >> by a non-expert. Given that we'll now have just one copy of the >> code, ISTM there's a good case for doing the small extra work >> to report catalog and column by name. > > Rebased v3 on top of recent conflicting ICU changes causing the patch to not > apply anymore. Also took another look around the tree to see if there were > missed callsites but found none new. I think the only open question here was the granularity of the error message, which I think you had addressed in v2. + if (isnull) + { + elog(ERROR, + "unexpected NULL value in cached tuple for pg_catalog.%s.%s", + get_rel_name(cacheinfo[cacheId].reloid), + NameStr(TupleDescAttr(SysCache[cacheId]->cc_tupdesc, attributeNumber - 1)->attname)); + } I prefer to use "null value" for SQL null values, and NULL for the C symbol. I'm a bit hesitant about hardcoding pg_catalog here. That happens to be true, of course, but isn't actually enforced, I think. I think that could be left off. It's not like people will be confused about which schema "pg_class.relname" is in. Also, the cached tuple isn't really for the attribute, so maybe split that up a bit, like "unexpected null value in cached tuple for catalog %s column %s"
On Tue, 14 Mar 2023 at 02:19, Daniel Gustafsson <daniel@yesql.se> wrote: > Rebased v3 on top of recent conflicting ICU changes causing the patch to not > apply anymore. Also took another look around the tree to see if there were > missed callsites but found none new. I had a look at this. It generally seems like a good change. One thing I thought about while looking is it stage 2 might do something similar for SearchSysCacheN. I then wondered if we're more likely to want to keep the localised __FILE__, __LINE__ and __func__ in the elog for those or not. It's probably less important that we're losing those for this change, but worth noting here at least in case nobody else thought of it. I only noticed in a couple of places you have a few lines at 80 chars before the LF. Ideally those would wrap at 79 so that it's 80 including LF. No big deal though. David
On 23.03.23 09:52, David Rowley wrote: > One thing I thought about while looking is it stage 2 might do > something similar for SearchSysCacheN. I then wondered if we're more > likely to want to keep the localised __FILE__, __LINE__ and __func__ > in the elog for those or not. It's probably less important that we're > losing those for this change, but worth noting here at least in case > nobody else thought of it. I don't follow what you are asking for here.
On Fri, 24 Mar 2023 at 20:31, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 23.03.23 09:52, David Rowley wrote: > > One thing I thought about while looking is it stage 2 might do > > something similar for SearchSysCacheN. I then wondered if we're more > > likely to want to keep the localised __FILE__, __LINE__ and __func__ > > in the elog for those or not. It's probably less important that we're > > losing those for this change, but worth noting here at least in case > > nobody else thought of it. > > I don't follow what you are asking for here. I had two points: 1. Doing something similar for SearchSysCache1 and co might be a good phase two to this change. 2. With the change Daniel is proposing here, \set VERBOSITY verbose is not going to print as useful information to tracking down where any unexpected nulls in the catalogue originates. For #2, I don't think that's necessarily a problem. I can think of two reasons why SysCacheGetAttrNotNull might throw an ERROR: a) We used SysCacheGetAttrNotNull() when we should have used SysCacheGetAttr(). b) Catalogue corruption. A more localised ERROR message might just help more easily tracking down type a) problems. I imagine it won't be too difficult to just grep for all the SysCacheGetAttrNotNull calls for the particular nullable column to find the one causing the issue. For b), the error message in SysCacheGetAttrNotNull is sufficient without needing to know where the SysCacheGetAttrNotNull call came from. David
> On 24 Mar 2023, at 21:59, David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 24 Mar 2023 at 20:31, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: >> On 23.03.23 09:52, David Rowley wrote: >>> One thing I thought about while looking is it stage 2 might do >>> something similar for SearchSysCacheN. I then wondered if we're more >>> likely to want to keep the localised __FILE__, __LINE__ and __func__ >>> in the elog for those or not. It's probably less important that we're >>> losing those for this change, but worth noting here at least in case >>> nobody else thought of it. >> >> I don't follow what you are asking for here. > > I had two points: > > 1. Doing something similar for SearchSysCache1 and co might be a good > phase two to this change. Quite possibly yes, they do follow a pretty repeatable pattern. > 2. With the change Daniel is proposing here, \set VERBOSITY verbose is > not going to print as useful information to tracking down where any > unexpected nulls in the catalogue originates. Thats a fair point for the elog() removals, for the rather many assertions it might be a net positive to get a non-local elog when failing in production. -- Daniel Gustafsson
> On 14 Mar 2023, at 08:00, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > I prefer to use "null value" for SQL null values, and NULL for the C symbol. Thats a fair point, I agree with that. > I'm a bit hesitant about hardcoding pg_catalog here. That happens to be true, of course, but isn't actually enforced,I think. I think that could be left off. It's not like people will be confused about which schema "pg_class.relname"is in. > > Also, the cached tuple isn't really for the attribute, so maybe split that up a bit, like > > "unexpected null value in cached tuple for catalog %s column %s" No objections, so changed to that wording. With these changes and a pgindent run across it per Davids comment downthread, I've pushed this now. Thanks for review! I'm keeping a watchful eye on the buildfarm; francolin has errored in recoveryCheck which I'm looking into but at first glance I don't think it's related (other animals have since passed it and it works locally, but I'll keep digging at it to make sure). -- Daniel Gustafsson
On Mon, Mar 13, 2023 at 02:19:07PM +0100, Daniel Gustafsson wrote: > > On 2 Mar 2023, at 15:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > >> I think an error message like > >> "unexpected null value in system cache %d column %d" > >> is sufficient. Since these are "can't happen" errors, we don't need to > >> spend too much extra effort to make it prettier. > > > > I'd at least like to see it give the catalog's OID. That's easily > > convertible to a name, and it doesn't tend to move around across PG > > versions, neither of which are true for syscache IDs. > > > > Also, I'm fairly unconvinced that it's a "can't happen" --- this > > would be very likely to fire as a result of catalog corruption, > > so it would be good if it's at least minimally interpretable > > by a non-expert. Given that we'll now have just one copy of the > > code, ISTM there's a good case for doing the small extra work > > to report catalog and column by name. > > Rebased v3 on top of recent conflicting ICU changes causing the patch to not > apply anymore. Also took another look around the tree to see if there were > missed callsites but found none new. +++ b/src/backend/utils/cache/syscache.c @@ -77,6 +77,7 @@ #include "catalog/pg_user_mapping.h" #include "lib/qunique.h" #include "utils/catcache.h" +#include "utils/lsyscache.h" #include "utils/rel.h" #include "utils/syscache.h" @@ -1099,6 +1100,32 @@ SysCacheGetAttr(int cacheId, HeapTuple tup, + elog(ERROR, + "unexpected NULL value in cached tuple for pg_catalog.%s.%s", + get_rel_name(cacheinfo[cacheId].reloid), Question: Is it safe to be making catalog access inside an error handler, when one of the most likely reason for hitting the error is catalog corruption ? Maybe the answer is that it's not "safe" but "safe enough" - IOW, if you're willing to throw an assertion, it's good enough to try to show the table name, and if the error report crashes the server, that's "not much worse" than having Asserted(). -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Mon, Mar 13, 2023 at 02:19:07PM +0100, Daniel Gustafsson wrote: >> + elog(ERROR, >> + "unexpected NULL value in cached tuple for pg_catalog.%s.%s", >> + get_rel_name(cacheinfo[cacheId].reloid), > Question: Is it safe to be making catalog access inside an error > handler, when one of the most likely reason for hitting the error is > catalog corruption ? I don't see a big problem here. If there were catalog corruption preventing fetching the catalog's pg_class row, it's highly unlikely that you'd have managed to retrieve a catalog row to complain about. (That is, corruption in this particular catalog entry probably does not extend to the metadata about the catalog containing it.) > Maybe the answer is that it's not "safe" but "safe enough" Right. If we got concerned about this we could dodge the extra catalog access by adding the catalog's name to CatCache entries. I doubt it's worth it though. We can always re-evaluate if we see actual evidence of problems. regards, tom lane