Thread: Internal error codes triggered by tests

Internal error codes triggered by tests

From
Michael Paquier
Date:
Hi all,

While analyzing the use of internal error codes in the code base, I've
some problems, and that's a mixed bag of:
- Incorrect uses, for errors that can be triggered by users with
vallid cases.
- Expected error cases, wanted by the tests like corruption cases or
  just to keep some code simpler.

Here is a summary of the ones that should be fixed with proper
errcodes:
1) be-secure-openssl.c is forgetting an error codes for code comparing
the ssl_{min,max}_protocol_version range, which should be a
ERRCODE_CONFIG_FILE_ERROR.
2) 010_pg_basebackup.pl includes a test for an unmatching system ID at
its bottom, that triggers an internal error as an effect of
manifest_report_error().
3) basebackup.c, with a too long symlink or tar name, where
ERRCODE_PROGRAM_LIMIT_EXCEEDED should make sense.
4) pg_walinspect, for an invalid LSN.  That would be
ERRCODE_INVALID_PARAMETER_VALUE.
5) Some paths of pg_ucol_open() are failing internally, still these
refer to parameters that can be set, so I've attached
ERRCODE_INVALID_PARAMETER_VALUE to them.
6) PerformWalRecovery(), where recovery ends before target is reached,
for a ERRCODE_CONFIG_FILE_ERROR.
7) pg_replication_slot_advance(), missing for the target LSN a
ERRCODE_INVALID_PARAMETER_VALUE.
8) Isolation test alter-table-4/s2, able to trigger a "attribute "a"
of relation "c1" does not match parent's type".  Shouldn't that be a
ERRCODE_INVALID_COLUMN_DEFINITION?

Then there is a series of issues triggered by the main regression test
suite, applying three times (pg_upgrade, make check and
027_stream_regress.pl):
1) MergeConstraintsIntoExisting() under a case of relhassubclass, for
ERRCODE_INVALID_OBJECT_DEFINITION.
2) Trigger rename on a partition, see renametrig(), for
ERRCODE_FEATURE_NOT_SUPPORTED.
3) "could not find suitable unique index on materialized view", with a
plain elog() in matview.c, for ERRCODE_FEATURE_NOT_SUPPORTED
4) "role \"blah\" cannot have explicit members", for
ERRCODE_FEATURE_NOT_SUPPORTED.
5) Similar to previous one, AddRoleMems() with "role \"blah\" cannot
be a member of any role"
6) icu_validate_locale(), icu_language_tag() and make_icu_collator()
for invalid parameter inputs.
7) ATExecAlterConstraint()

There are a few fancy cases where we expect an internal error per the
state of the tests:
1) amcheck
1-1) bt_index_check_internal() in amcheck, where the code has the idea
to open an OID given in input, trigerring an elog().  That does not
strike me as a good idea, though that's perhaps acceptable.  The error
is an "could not open relation with OID".
1-2) 003_check.pl has 12 cases with backtraces expected in the outcome
as these check corruption cases.
2) pg_visibility does the same thing, for two tests trigerring a
"could not open relation with OID".
3) plpython with 6 cases which are internal, not sure what to do about
these.
4) test_resowner has two cases triggered by SQL functions, which are
expected to be internal.
5) test_tidstore, with "tuple offset out of range" triggered by a SQL
call.
6) injection_points, that are aimed for tests, has six backtraces.
7) currtid_internal()..  Perhaps we should try to rip out this stuff,
which is specific to ODBC.  There are a lot of backtraces here.
8) satisfies_hash_partition() in test hash_part, generating a
backtrace for an InvalidOid in the main regression test suite.

All these cases are able to trigger backtraces, and while of them are
OK to keep as they are, the cases of the first and second lists ought
to be fixed, and attached is a patch do close the gap.  This reduces
the number of internal errors generated by the tests from 85 to 35,
with injection points enabled.

Note that I've kept the currtid() errcodes in it, though I don't think
much of them.  The rest looks sensible enough to address.

Thoughts?
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Justin Pryzby
Date:
I sent a list of user-facing elogs here, a few times.
ZDclRM/jaTe66nce@telsasoft.com

And if someone had expressed an interest, I might have sent a longer
list.



Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Mon, Apr 29, 2024 at 08:02:45AM -0500, Justin Pryzby wrote:
> I sent a list of user-facing elogs here, a few times.
> ZDclRM/jaTe66nce@telsasoft.com
>
> And if someone had expressed an interest, I might have sent a longer
> list.

Thanks.  I'll take a look at what you have there.  Nothing would be
committed before v18 opens, though.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Robert Haas
Date:
On Tue, Apr 23, 2024 at 12:55 AM Michael Paquier <michael@paquier.xyz> wrote:
> Thoughts?

The patch as proposed seems fine. Marking RfC.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Fri, May 17, 2024 at 01:41:29PM -0400, Robert Haas wrote:
> On Tue, Apr 23, 2024 at 12:55 AM Michael Paquier <michael@paquier.xyz> wrote:
>> Thoughts?
>
> The patch as proposed seems fine. Marking RfC.

Thanks.  I'll look again at that once v18 opens up for business.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Sat, May 18, 2024 at 10:56:43AM +0900, Michael Paquier wrote:
> Thanks.  I'll look again at that once v18 opens up for business.

Looked at that again, and one in tablecmds.c is not needed anymore,
and there was a conflict in be-secure-openssl.c.  Removed the first
one, fixed the second one, then applied the patch after a second look.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Alexander Lakhin
Date:
Hello Michael,

04.07.2024 03:51, Michael Paquier wrote:
> On Sat, May 18, 2024 at 10:56:43AM +0900, Michael Paquier wrote:
>> Thanks.  I'll look again at that once v18 opens up for business.
> Looked at that again, and one in tablecmds.c is not needed anymore,
> and there was a conflict in be-secure-openssl.c.  Removed the first
> one, fixed the second one, then applied the patch after a second look.

Could you please share your thoughts regarding other error cases, which is
not triggered by existing tests, but still can be easily reached by users?

For example:
SELECT satisfies_hash_partition(1, 1, 0, 0);

ERROR:  XX000: could not open relation with OID 1
LOCATION:  relation_open, relation.c:61

or:
CREATE TABLE t (b bytea);
INSERT INTO t SELECT ''::bytea;
CREATE INDEX brinidx ON t USING brin
  (b bytea_bloom_ops(n_distinct_per_range = -1.0));

ERROR:  XX000: the bloom filter is too large (44629 > 8144)
LOCATION:  bloom_init, brin_bloom.c:344

Should such cases be corrected too?

Best regards,
Alexander



Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote:
> Could you please share your thoughts regarding other error cases, which is
> not triggered by existing tests, but still can be easily reached by users?
>
> For example:
> SELECT satisfies_hash_partition(1, 1, 0, 0);
>
> ERROR:  XX000: could not open relation with OID 1
> LOCATION:  relation_open, relation.c:61
>
> or:
> CREATE TABLE t (b bytea);
> INSERT INTO t SELECT ''::bytea;
> CREATE INDEX brinidx ON t USING brin
>  (b bytea_bloom_ops(n_distinct_per_range = -1.0));
>
> ERROR:  XX000: the bloom filter is too large (44629 > 8144)
> LOCATION:  bloom_init, brin_bloom.c:344
>
> Should such cases be corrected too?

This is a case-by-case.  satisfies_hash_partition() is undocumented,
so doing nothing is fine by me.  The second one, though is something
taht can be triggered with rather normal DDL sequences.  That's more
annoying.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Alexander Lakhin
Date:
Hello Michael,

05.07.2024 03:57, Michael Paquier wrote:
> On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote:
>> Could you please share your thoughts regarding other error cases, which is
>> not triggered by existing tests, but still can be easily reached by users?
>>
>> For example:
>> SELECT satisfies_hash_partition(1, 1, 0, 0);
>>
>> ERROR:  XX000: could not open relation with OID 1
>> LOCATION:  relation_open, relation.c:61
>>
>> or:
>> CREATE TABLE t (b bytea);
>> INSERT INTO t SELECT ''::bytea;
>> CREATE INDEX brinidx ON t USING brin
>>   (b bytea_bloom_ops(n_distinct_per_range = -1.0));
>>
>> ERROR:  XX000: the bloom filter is too large (44629 > 8144)
>> LOCATION:  bloom_init, brin_bloom.c:344
>>
>> Should such cases be corrected too?
> This is a case-by-case.  satisfies_hash_partition() is undocumented,
> so doing nothing is fine by me.  The second one, though is something
> taht can be triggered with rather normal DDL sequences.  That's more
> annoying.

Thank you for the answer!

Let me show you other error types for discussion/classification:
SELECT pg_describe_object(1::regclass, 0, 0);

ERROR:  XX000: unsupported object class: 1
LOCATION:  getObjectDescription, objectaddress.c:4016
or
SELECT pg_identify_object_as_address('1'::regclass, 1, 1);

ERROR:  XX000: unsupported object class: 1
LOCATION:  getObjectTypeDescription, objectaddress.c:4597

--
SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000))
\gexec
ERROR:  XX000: could not open file "pg_snapshots/-----...---" for reading: File name too long
LOCATION:  ImportSnapshot, snapmgr.c:1428

--
CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq,
   commutator = ===, hashes);

CREATE TABLE t1 (a int);
ANALYZE t1;
CREATE TABLE t2 (a int);

SELECT * FROM t1, t2 WHERE t1.a === t2.a;

ERROR:  XX000: could not find hash function for hash operator 16385
LOCATION:  ExecHashTableCreate, nodeHash.c:560

--
WITH RECURSIVE oq(x) AS (
     WITH iq as (SELECT * FROM oq)
     SELECT * FROM iq
     UNION
     SELECT * from iq
)
SELECT * FROM oq;

ERROR:  XX000: missing recursive reference
LOCATION:  checkWellFormedRecursion, parse_cte.c:896
(commented as "should not happen", but still...)

--
CREATE EXTENSION ltree;
SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery;

ERROR:  XX000: stack too short
LOCATION:  makepol, ltxtquery_io.c:252

--
There is also a couple of dubious ereport(DEBUG1,
(errcode(ERRCODE_INTERNAL_ERROR), ...) calls like:
         /*
          * User-defined picksplit failed to create an actual split, ie it put
          * everything on the same side.  Complain but cope.
          */
         ereport(DEBUG1,
                 (errcode(ERRCODE_INTERNAL_ERROR),
                  errmsg("picksplit method for column %d of index \"%s\" failed",
                         attno + 1, RelationGetRelationName(r)),

I'm not mentioning errors, that require more analysis and maybe correcting
the surrounding logic, not ERRCODE only.

Maybe it makes sense to separate the discussion of such errors, which are
not triggered by tests/not covered; I'm just not sure how to handle them
efficiently.

Best regards,
Alexander



Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Mon, Jul 08, 2024 at 12:00:00PM +0300, Alexander Lakhin wrote:
> Let me show you other error types for discussion/classification:
> SELECT pg_describe_object(1::regclass, 0, 0);
>
> ERROR:  XX000: unsupported object class: 1
> LOCATION:  getObjectDescription, objectaddress.c:4016
> or
> SELECT pg_identify_object_as_address('1'::regclass, 1, 1);
>
> ERROR:  XX000: unsupported object class: 1
> LOCATION:  getObjectTypeDescription, objectaddress.c:4597

These ones are old enough, indeed.  Knowing that they usually come
down to be used with scans of pg_shdepend and pg_depend to get some
information about the objects involved, I've never come down to see if
these were really worth tackling.  The cases of dropped/undefined
case is much more interesting, because we may give in input object IDs
that have been dropped concurrently.  Class IDs are constants fixed in
time.

> --
> SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
> SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000))
> \gexec
> ERROR:  XX000: could not open file "pg_snapshots/-----...---" for reading: File name too long
> LOCATION:  ImportSnapshot, snapmgr.c:1428

This one is fun.  errcode_for_file_access() does not know about
ENAMETOOLONG, as an effect of the errno returned by AllocateFile().
Perhaps it should map to ERRCODE_NAME_TOO_LONG?

> CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq,
>   commutator = ===, hashes);
>
> CREATE TABLE t1 (a int);
> ANALYZE t1;
> CREATE TABLE t2 (a int);
>
> SELECT * FROM t1, t2 WHERE t1.a === t2.a;
>
> ERROR:  XX000: could not find hash function for hash operator 16385
> LOCATION:  ExecHashTableCreate, nodeHash.c:560

Hehe.  You are telling that this operator supports a hash join, but
nope.  I am not really convinced that this is worth worrying.


> --
> WITH RECURSIVE oq(x) AS (
>     WITH iq as (SELECT * FROM oq)
>     SELECT * FROM iq
>     UNION
>     SELECT * from iq
> )
> SELECT * FROM oq;
>
> ERROR:  XX000: missing recursive reference
> LOCATION:  checkWellFormedRecursion, parse_cte.c:896
> (commented as "should not happen", but still...)

Hmm.  This one feels like a bug, indeed.

> --
> CREATE EXTENSION ltree;
> SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery;
>
> ERROR:  XX000: stack too short
> LOCATION:  makepol, ltxtquery_io.c:252

ltree has little maintenance, not sure that's worth worrying.

> I'm not mentioning errors, that require more analysis and maybe correcting
> the surrounding logic, not ERRCODE only.
>
> Maybe it makes sense to separate the discussion of such errors, which are
> not triggered by tests/not covered; I'm just not sure how to handle them
> efficiently.

The scope is too broad for some of them like the CTE case, so separate
threads to attract the correct review audience makes sense to me.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Daniel Gustafsson
Date:
> On 10 Jul 2024, at 06:42, Michael Paquier <michael@paquier.xyz> wrote:

>> SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
>> SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000))
>> \gexec
>> ERROR:  XX000: could not open file "pg_snapshots/-----...---" for reading: File name too long
>> LOCATION:  ImportSnapshot, snapmgr.c:1428
>
> This one is fun.  errcode_for_file_access() does not know about
> ENAMETOOLONG, as an effect of the errno returned by AllocateFile().
> Perhaps it should map to ERRCODE_NAME_TOO_LONG?

Mapping this case to ERRCODE_NAME_TOO_LONG seems like a legit improvement, even
though the error is likely to be quite rare in production.

The rest mentioned upthread seems either not worth the effort or are likely to
be bugs warranting proper investigation.

--
Daniel Gustafsson




Re: Internal error codes triggered by tests

From
Alexander Lakhin
Date:
Hello Daniel and Michael,

12.07.2024 23:41, Daniel Gustafsson wrote:
>> On 10 Jul 2024, at 06:42, Michael Paquier <michael@paquier.xyz> wrote:
> The rest mentioned upthread seems either not worth the effort or are likely to
> be bugs warranting proper investigation.
>

I've filed a bug report about the "WITH RECURSIVE" anomaly: [1], but what
I wanted to understand when presenting different error kinds is what
definition XX000 errors could have in principle?

It seems to me that we can't define them as indicators of unexpected (from
the server's POV) conditions, similar to assertion failures (but produced
with no asserts enabled), which should be and mostly get fixed.

If the next thing to do is to get backtrace_on_internal_error back and
that GUC is mainly intended for developers, then maybe having clean (or
containing expected backtraces only) regression test logs is a final goal
and we should stop here. But if it's expected that that GUC could be
helpful for users to analyze such errors in production and thus pay extra
attention to them, maybe having XX000 status for presumably
unreachable conditions only is desirable...

[1] https://www.postgresql.org/message-id/18536-0a342ec07901203e%40postgresql.org

Best regards,
Alexander



Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Sun, Jul 14, 2024 at 07:00:00PM +0300, Alexander Lakhin wrote:
> I've filed a bug report about the "WITH RECURSIVE" anomaly: [1], but what
> I wanted to understand when presenting different error kinds is what
> definition XX000 errors could have in principle?

Cool, thanks!  I can see that Tom has already committed a fix.

I'm going to start a new thread for ERRCODE_NAME_TOO_LONG.  It would
be confusing to solve the issue in the middle of this thread.

> If the next thing to do is to get backtrace_on_internal_error back and
> that GUC is mainly intended for developers, then maybe having clean (or
> containing expected backtraces only) regression test logs is a final goal
> and we should stop here. But if it's expected that that GUC could be
> helpful for users to analyze such errors in production and thus pay extra
> attention to them, maybe having XX000 status for presumably
> unreachable conditions only is desirable...

Perhaps.  Let's see where it leads if we have this discussion again.
Some internal errors cannot be avoided because some tests expect such
cases (failures with on-disk file manipulation is one).
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Fri, Jul 12, 2024 at 10:41:14PM +0200, Daniel Gustafsson wrote:
> Mapping this case to ERRCODE_NAME_TOO_LONG seems like a legit improvement, even
> though the error is likely to be quite rare in production.

Hmm.  This is interesting, still it could be confusing as
ERRCODE_NAME_TOO_LONG is used only for names, when they are longer
than NAMEDATALEN, so in context that's a bit different than a longer
file name.

How about using a new error code in class 58, say a
ERRCODE_FILE_NAME_TOO_LONG like in the attached?
ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping
with EEXIST.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Daniel Gustafsson
Date:
> On 18 Jul 2024, at 09:29, Michael Paquier <michael@paquier.xyz> wrote:

> How about using a new error code in class 58, say a
> ERRCODE_FILE_NAME_TOO_LONG like in the attached?
> ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping
> with EEXIST.

Agreed, that's probably a better option.

--
Daniel Gustafsson




Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Thu, Jul 18, 2024 at 09:37:06AM +0200, Daniel Gustafsson wrote:
> On 18 Jul 2024, at 09:29, Michael Paquier <michael@paquier.xyz> wrote:
>> How about using a new error code in class 58, say a
>> ERRCODE_FILE_NAME_TOO_LONG like in the attached?
>> ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping
>> with EEXIST.
>
> Agreed, that's probably a better option.

Still sounds like a better idea to me after a night of sleep.  Would
somebody disagree about this idea?  HEAD only, of course.
--
Michael

Attachment

Re: Internal error codes triggered by tests

From
Michael Paquier
Date:
On Thu, Jul 18, 2024 at 09:37:06AM +0200, Daniel Gustafsson wrote:
> On 18 Jul 2024, at 09:29, Michael Paquier <michael@paquier.xyz> wrote:
>> How about using a new error code in class 58, say a
>> ERRCODE_FILE_NAME_TOO_LONG like in the attached?
>> ERRCODE_DUPLICATE_FILE is like that; it exists just for the mapping
>> with EEXIST.
>
> Agreed, that's probably a better option.

Applied this one now on HEAD.  On second look, all buildfarm
environments seem to be OK with this errno, as far as I've checked, so
that should be OK.
--
Michael

Attachment