Thread: Internal error codes triggered by tests
Hi all, While analyzing the use of internal error codes in the code base, I've some problems, and that's a mixed bag of: - Incorrect uses, for errors that can be triggered by users with vallid cases. - Expected error cases, wanted by the tests like corruption cases or just to keep some code simpler. Here is a summary of the ones that should be fixed with proper errcodes: 1) be-secure-openssl.c is forgetting an error codes for code comparing the ssl_{min,max}_protocol_version range, which should be a ERRCODE_CONFIG_FILE_ERROR. 2) 010_pg_basebackup.pl includes a test for an unmatching system ID at its bottom, that triggers an internal error as an effect of manifest_report_error(). 3) basebackup.c, with a too long symlink or tar name, where ERRCODE_PROGRAM_LIMIT_EXCEEDED should make sense. 4) pg_walinspect, for an invalid LSN. That would be ERRCODE_INVALID_PARAMETER_VALUE. 5) Some paths of pg_ucol_open() are failing internally, still these refer to parameters that can be set, so I've attached ERRCODE_INVALID_PARAMETER_VALUE to them. 6) PerformWalRecovery(), where recovery ends before target is reached, for a ERRCODE_CONFIG_FILE_ERROR. 7) pg_replication_slot_advance(), missing for the target LSN a ERRCODE_INVALID_PARAMETER_VALUE. 8) Isolation test alter-table-4/s2, able to trigger a "attribute "a" of relation "c1" does not match parent's type". Shouldn't that be a ERRCODE_INVALID_COLUMN_DEFINITION? Then there is a series of issues triggered by the main regression test suite, applying three times (pg_upgrade, make check and 027_stream_regress.pl): 1) MergeConstraintsIntoExisting() under a case of relhassubclass, for ERRCODE_INVALID_OBJECT_DEFINITION. 2) Trigger rename on a partition, see renametrig(), for ERRCODE_FEATURE_NOT_SUPPORTED. 3) "could not find suitable unique index on materialized view", with a plain elog() in matview.c, for ERRCODE_FEATURE_NOT_SUPPORTED 4) "role \"blah\" cannot have explicit members", for ERRCODE_FEATURE_NOT_SUPPORTED. 5) Similar to previous one, AddRoleMems() with "role \"blah\" cannot be a member of any role" 6) icu_validate_locale(), icu_language_tag() and make_icu_collator() for invalid parameter inputs. 7) ATExecAlterConstraint() There are a few fancy cases where we expect an internal error per the state of the tests: 1) amcheck 1-1) bt_index_check_internal() in amcheck, where the code has the idea to open an OID given in input, trigerring an elog(). That does not strike me as a good idea, though that's perhaps acceptable. The error is an "could not open relation with OID". 1-2) 003_check.pl has 12 cases with backtraces expected in the outcome as these check corruption cases. 2) pg_visibility does the same thing, for two tests trigerring a "could not open relation with OID". 3) plpython with 6 cases which are internal, not sure what to do about these. 4) test_resowner has two cases triggered by SQL functions, which are expected to be internal. 5) test_tidstore, with "tuple offset out of range" triggered by a SQL call. 6) injection_points, that are aimed for tests, has six backtraces. 7) currtid_internal().. Perhaps we should try to rip out this stuff, which is specific to ODBC. There are a lot of backtraces here. 8) satisfies_hash_partition() in test hash_part, generating a backtrace for an InvalidOid in the main regression test suite. All these cases are able to trigger backtraces, and while of them are OK to keep as they are, the cases of the first and second lists ought to be fixed, and attached is a patch do close the gap. This reduces the number of internal errors generated by the tests from 85 to 35, with injection points enabled. Note that I've kept the currtid() errcodes in it, though I don't think much of them. The rest looks sensible enough to address. Thoughts? -- Michael
Attachment
I sent a list of user-facing elogs here, a few times. ZDclRM/jaTe66nce@telsasoft.com And if someone had expressed an interest, I might have sent a longer list.
On Mon, Apr 29, 2024 at 08:02:45AM -0500, Justin Pryzby wrote: > I sent a list of user-facing elogs here, a few times. > ZDclRM/jaTe66nce@telsasoft.com > > And if someone had expressed an interest, I might have sent a longer > list. Thanks. I'll take a look at what you have there. Nothing would be committed before v18 opens, though. -- Michael
Attachment
On Tue, Apr 23, 2024 at 12:55 AM Michael Paquier <michael@paquier.xyz> wrote: > Thoughts? The patch as proposed seems fine. Marking RfC. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, May 17, 2024 at 01:41:29PM -0400, Robert Haas wrote: > On Tue, Apr 23, 2024 at 12:55 AM Michael Paquier <michael@paquier.xyz> wrote: >> Thoughts? > > The patch as proposed seems fine. Marking RfC. Thanks. I'll look again at that once v18 opens up for business. -- Michael
Attachment
On Sat, May 18, 2024 at 10:56:43AM +0900, Michael Paquier wrote: > Thanks. I'll look again at that once v18 opens up for business. Looked at that again, and one in tablecmds.c is not needed anymore, and there was a conflict in be-secure-openssl.c. Removed the first one, fixed the second one, then applied the patch after a second look. -- Michael
Attachment
Hello Michael, 04.07.2024 03:51, Michael Paquier wrote: > On Sat, May 18, 2024 at 10:56:43AM +0900, Michael Paquier wrote: >> Thanks. I'll look again at that once v18 opens up for business. > Looked at that again, and one in tablecmds.c is not needed anymore, > and there was a conflict in be-secure-openssl.c. Removed the first > one, fixed the second one, then applied the patch after a second look. Could you please share your thoughts regarding other error cases, which is not triggered by existing tests, but still can be easily reached by users? For example: SELECT satisfies_hash_partition(1, 1, 0, 0); ERROR: XX000: could not open relation with OID 1 LOCATION: relation_open, relation.c:61 or: CREATE TABLE t (b bytea); INSERT INTO t SELECT ''::bytea; CREATE INDEX brinidx ON t USING brin (b bytea_bloom_ops(n_distinct_per_range = -1.0)); ERROR: XX000: the bloom filter is too large (44629 > 8144) LOCATION: bloom_init, brin_bloom.c:344 Should such cases be corrected too? Best regards, Alexander
On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote: > Could you please share your thoughts regarding other error cases, which is > not triggered by existing tests, but still can be easily reached by users? > > For example: > SELECT satisfies_hash_partition(1, 1, 0, 0); > > ERROR: XX000: could not open relation with OID 1 > LOCATION: relation_open, relation.c:61 > > or: > CREATE TABLE t (b bytea); > INSERT INTO t SELECT ''::bytea; > CREATE INDEX brinidx ON t USING brin > (b bytea_bloom_ops(n_distinct_per_range = -1.0)); > > ERROR: XX000: the bloom filter is too large (44629 > 8144) > LOCATION: bloom_init, brin_bloom.c:344 > > Should such cases be corrected too? This is a case-by-case. satisfies_hash_partition() is undocumented, so doing nothing is fine by me. The second one, though is something taht can be triggered with rather normal DDL sequences. That's more annoying. -- Michael
Attachment
Hello Michael, 05.07.2024 03:57, Michael Paquier wrote: > On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote: >> Could you please share your thoughts regarding other error cases, which is >> not triggered by existing tests, but still can be easily reached by users? >> >> For example: >> SELECT satisfies_hash_partition(1, 1, 0, 0); >> >> ERROR: XX000: could not open relation with OID 1 >> LOCATION: relation_open, relation.c:61 >> >> or: >> CREATE TABLE t (b bytea); >> INSERT INTO t SELECT ''::bytea; >> CREATE INDEX brinidx ON t USING brin >> (b bytea_bloom_ops(n_distinct_per_range = -1.0)); >> >> ERROR: XX000: the bloom filter is too large (44629 > 8144) >> LOCATION: bloom_init, brin_bloom.c:344 >> >> Should such cases be corrected too? > This is a case-by-case. satisfies_hash_partition() is undocumented, > so doing nothing is fine by me. The second one, though is something > taht can be triggered with rather normal DDL sequences. That's more > annoying. Thank you for the answer! Let me show you other error types for discussion/classification: SELECT pg_describe_object(1::regclass, 0, 0); ERROR: XX000: unsupported object class: 1 LOCATION: getObjectDescription, objectaddress.c:4016 or SELECT pg_identify_object_as_address('1'::regclass, 1, 1); ERROR: XX000: unsupported object class: 1 LOCATION: getObjectTypeDescription, objectaddress.c:4597 -- SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000)) \gexec ERROR: XX000: could not open file "pg_snapshots/-----...---" for reading: File name too long LOCATION: ImportSnapshot, snapmgr.c:1428 -- CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq, commutator = ===, hashes); CREATE TABLE t1 (a int); ANALYZE t1; CREATE TABLE t2 (a int); SELECT * FROM t1, t2 WHERE t1.a === t2.a; ERROR: XX000: could not find hash function for hash operator 16385 LOCATION: ExecHashTableCreate, nodeHash.c:560 -- WITH RECURSIVE oq(x) AS ( WITH iq as (SELECT * FROM oq) SELECT * FROM iq UNION SELECT * from iq ) SELECT * FROM oq; ERROR: XX000: missing recursive reference LOCATION: checkWellFormedRecursion, parse_cte.c:896 (commented as "should not happen", but still...) -- CREATE EXTENSION ltree; SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery; ERROR: XX000: stack too short LOCATION: makepol, ltxtquery_io.c:252 -- There is also a couple of dubious ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR), ...) calls like: /* * User-defined picksplit failed to create an actual split, ie it put * everything on the same side. Complain but cope. */ ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR), errmsg("picksplit method for column %d of index \"%s\" failed", attno + 1, RelationGetRelationName(r)), I'm not mentioning errors, that require more analysis and maybe correcting the surrounding logic, not ERRCODE only. Maybe it makes sense to separate the discussion of such errors, which are not triggered by tests/not covered; I'm just not sure how to handle them efficiently. Best regards, Alexander
On Mon, Jul 08, 2024 at 12:00:00PM +0300, Alexander Lakhin wrote: > Let me show you other error types for discussion/classification: > SELECT pg_describe_object(1::regclass, 0, 0); > > ERROR: XX000: unsupported object class: 1 > LOCATION: getObjectDescription, objectaddress.c:4016 > or > SELECT pg_identify_object_as_address('1'::regclass, 1, 1); > > ERROR: XX000: unsupported object class: 1 > LOCATION: getObjectTypeDescription, objectaddress.c:4597 These ones are old enough, indeed. Knowing that they usually come down to be used with scans of pg_shdepend and pg_depend to get some information about the objects involved, I've never come down to see if these were really worth tackling. The cases of dropped/undefined case is much more interesting, because we may give in input object IDs that have been dropped concurrently. Class IDs are constants fixed in time. > -- > SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; > SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000)) > \gexec > ERROR: XX000: could not open file "pg_snapshots/-----...---" for reading: File name too long > LOCATION: ImportSnapshot, snapmgr.c:1428 This one is fun. errcode_for_file_access() does not know about ENAMETOOLONG, as an effect of the errno returned by AllocateFile(). Perhaps it should map to ERRCODE_NAME_TOO_LONG? > CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq, > commutator = ===, hashes); > > CREATE TABLE t1 (a int); > ANALYZE t1; > CREATE TABLE t2 (a int); > > SELECT * FROM t1, t2 WHERE t1.a === t2.a; > > ERROR: XX000: could not find hash function for hash operator 16385 > LOCATION: ExecHashTableCreate, nodeHash.c:560 Hehe. You are telling that this operator supports a hash join, but nope. I am not really convinced that this is worth worrying. > -- > WITH RECURSIVE oq(x) AS ( > WITH iq as (SELECT * FROM oq) > SELECT * FROM iq > UNION > SELECT * from iq > ) > SELECT * FROM oq; > > ERROR: XX000: missing recursive reference > LOCATION: checkWellFormedRecursion, parse_cte.c:896 > (commented as "should not happen", but still...) Hmm. This one feels like a bug, indeed. > -- > CREATE EXTENSION ltree; > SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery; > > ERROR: XX000: stack too short > LOCATION: makepol, ltxtquery_io.c:252 ltree has little maintenance, not sure that's worth worrying. > I'm not mentioning errors, that require more analysis and maybe correcting > the surrounding logic, not ERRCODE only. > > Maybe it makes sense to separate the discussion of such errors, which are > not triggered by tests/not covered; I'm just not sure how to handle them > efficiently. The scope is too broad for some of them like the CTE case, so separate threads to attract the correct review audience makes sense to me. -- Michael
Attachment
> On 10 Jul 2024, at 06:42, Michael Paquier <michael@paquier.xyz> wrote: >> SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; >> SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000)) >> \gexec >> ERROR: XX000: could not open file "pg_snapshots/-----...---" for reading: File name too long >> LOCATION: ImportSnapshot, snapmgr.c:1428 > > This one is fun. errcode_for_file_access() does not know about > ENAMETOOLONG, as an effect of the errno returned by AllocateFile(). > Perhaps it should map to ERRCODE_NAME_TOO_LONG? Mapping this case to ERRCODE_NAME_TOO_LONG seems like a legit improvement, even though the error is likely to be quite rare in production. The rest mentioned upthread seems either not worth the effort or are likely to be bugs warranting proper investigation. -- Daniel Gustafsson
Hello Daniel and Michael, 12.07.2024 23:41, Daniel Gustafsson wrote: >> On 10 Jul 2024, at 06:42, Michael Paquier <michael@paquier.xyz> wrote: > The rest mentioned upthread seems either not worth the effort or are likely to > be bugs warranting proper investigation. > I've filed a bug report about the "WITH RECURSIVE" anomaly: [1], but what I wanted to understand when presenting different error kinds is what definition XX000 errors could have in principle? It seems to me that we can't define them as indicators of unexpected (from the server's POV) conditions, similar to assertion failures (but produced with no asserts enabled), which should be and mostly get fixed. If the next thing to do is to get backtrace_on_internal_error back and that GUC is mainly intended for developers, then maybe having clean (or containing expected backtraces only) regression test logs is a final goal and we should stop here. But if it's expected that that GUC could be helpful for users to analyze such errors in production and thus pay extra attention to them, maybe having XX000 status for presumably unreachable conditions only is desirable... [1] https://www.postgresql.org/message-id/18536-0a342ec07901203e%40postgresql.org Best regards, Alexander
On Sun, Jul 14, 2024 at 07:00:00PM +0300, Alexander Lakhin wrote: > I've filed a bug report about the "WITH RECURSIVE" anomaly: [1], but what > I wanted to understand when presenting different error kinds is what > definition XX000 errors could have in principle? Cool, thanks! I can see that Tom has already committed a fix. I'm going to start a new thread for ERRCODE_NAME_TOO_LONG. It would be confusing to solve the issue in the middle of this thread. > If the next thing to do is to get backtrace_on_internal_error back and > that GUC is mainly intended for developers, then maybe having clean (or > containing expected backtraces only) regression test logs is a final goal > and we should stop here. But if it's expected that that GUC could be > helpful for users to analyze such errors in production and thus pay extra > attention to them, maybe having XX000 status for presumably > unreachable conditions only is desirable... Perhaps. Let's see where it leads if we have this discussion again. Some internal errors cannot be avoided because some tests expect such cases (failures with on-disk file manipulation is one). -- Michael
Attachment
On Fri, Jul 12, 2024 at 10:41:14PM +0200, Daniel Gustafsson wrote: > Mapping this case to ERRCODE_NAME_TOO_LONG seems like a legit improvement, even > though the error is likely to be quite rare in production. Hmm. This is interesting, still it could be confusing as ERRCODE_NAME_TOO_LONG is used only for names, when they are longer than NAMEDATALEN, so in context that's a bit different than a longer file name. How about using a new error code in class 58, say a ERRCODE_FILE_NAME_TOO_LONG like in the attached? ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping with EEXIST. -- Michael
Attachment
> On 18 Jul 2024, at 09:29, Michael Paquier <michael@paquier.xyz> wrote: > How about using a new error code in class 58, say a > ERRCODE_FILE_NAME_TOO_LONG like in the attached? > ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping > with EEXIST. Agreed, that's probably a better option. -- Daniel Gustafsson
On Thu, Jul 18, 2024 at 09:37:06AM +0200, Daniel Gustafsson wrote: > On 18 Jul 2024, at 09:29, Michael Paquier <michael@paquier.xyz> wrote: >> How about using a new error code in class 58, say a >> ERRCODE_FILE_NAME_TOO_LONG like in the attached? >> ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping >> with EEXIST. > > Agreed, that's probably a better option. Still sounds like a better idea to me after a night of sleep. Would somebody disagree about this idea? HEAD only, of course. -- Michael
Attachment
On Thu, Jul 18, 2024 at 09:37:06AM +0200, Daniel Gustafsson wrote: > On 18 Jul 2024, at 09:29, Michael Paquier <michael@paquier.xyz> wrote: >> How about using a new error code in class 58, say a >> ERRCODE_FILE_NAME_TOO_LONG like in the attached? >> ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping >> with EEXIST. > > Agreed, that's probably a better option. Applied this one now on HEAD. On second look, all buildfarm environments seem to be OK with this errno, as far as I've checked, so that should be OK. -- Michael