Thread: [sqlsmith] Failed assertion in joinrels.c
Hi, sqlsmith triggered the following assertion in master (c188204). TRAP: FailedAssertion("!(!bms_overlap(joinrelids, sjinfo->min_lefthand))", File: "joinrels.c", Line: 500) As usual, the query is against the regression database. It is rather unwieldy… I wonder if I should stop working on new grammar rules and instead work on some post-processing that prunes the AST as much as possible while maintaining the failure mode. regards, andreas select subq_647409.c0 as c0, subq_647409.c0 as c1 from public.customer as rel_4116461 left join public.clstr_tst_s as rel_4116555 left join information_schema.columnsas rel_4116556 on (rel_4116555.rf_a = rel_4116556.ordinal_position )right join pg_catalog.pg_rolesas rel_4116557on (rel_4116556.maximum_cardinality = rel_4116557.rolconnlimit ) on (rel_4116461.passwd= rel_4116557.rolpassword ) left join (select subq_647410.c8 as c0 from public.char_tbl as rel_4116611, lateral (select rel_4116612.name as c0, rel_4116612.comment as c1, rel_4116612.nslots as c2, rel_4116612.comment as c3, rel_4116612.nslots as c4, rel_4116612.nslots as c5, rel_4116612.commentas c6, rel_4116612.comment as c7, rel_4116612.nslots as c8, rel_4116612.nslots as c9, rel_4116612.nslots as c10 from public.hub as rel_4116612 where rel_4116612.comment ~>=~ rel_4116612.comment fetch first 116 rows only) as subq_647410 where (subq_647410.c7 !~~* subq_647410.c7) or ((subq_647410.c3= subq_647410.c3) and ((subq_647410.c7 ~* subq_647410.c6) and (subq_647410.c7 @@ subq_647410.c7))) fetch first 152 rows only) as subq_647409 inner join public.int4_tbl as rel_4116661 inner join public.shoeas rel_4116662 on (rel_4116661.f1 = rel_4116662.sh_avail )inner join public.rtest_vview3 as rel_4116663on (rel_4116661.f1= rel_4116663.a ) on (subq_647409.c0 = rel_4116662.sh_avail ) on (rel_4116555.b = rel_4116661.f1 ) where ((rel_4116557.rolvaliduntil is NULL) or (rel_4116663.b !~ rel_4116461.name)) or (rel_4116661.f1 is not NULL) fetch first 80 rows only;
Andreas Seltenreich <seltenreich@gmx.de> writes: > sqlsmith triggered the following assertion in master (c188204). > TRAP: FailedAssertion("!(!bms_overlap(joinrelids, sjinfo->min_lefthand))", File: "joinrels.c", Line: 500) Cool, I'll take a look. > As usual, the query is against the regression database. It is rather > unwieldy… I wonder if I should stop working on new grammar rules and > instead work on some post-processing that prunes the AST as much as > possible while maintaining the failure mode. Probably not really worth the trouble; I find it's usually easy to produce a minimized test case after the failure cause is understood. What concerns me more is that what you're finding is only cases that trip an assertion sanity check. It seems likely that you're also managing to trigger other bugs with less drastic consequences, such as "could not devise a query plan" failures or just plain wrong answers. I'm not sure how we could identify wrong answers automatically :-( but it might be worth checking for XX000 SQLSTATE responses, since generally that should be a can't-happen case. (Or if it can happen, we need to change the errcode.) regards, tom lane
Tom Lane writes: > What concerns me more is that what you're finding is only cases that trip > an assertion sanity check. It seems likely that you're also managing to > trigger other bugs with less drastic consequences, such as "could not > devise a query plan" failures or just plain wrong answers. Ja, some of these are logged as well[1], but most of them are really as undrastic as can get, and I was afraid reporting them would be more of a nuisance. I analysed a couple of the cache lookup failures, and they all had a similar severreness than the example in the README[2]. The operator ones I analysed seem due to intentionally broken operators in the regression db. The NestLoopParams and subplan reference one sound interesting though… > I'm not sure how we could identify wrong answers automatically :-( Csmith isn't doing this either. They discuss differential testing though in their papers, i.e., comparing the results of different products. Maybe a simple metric like numbers of rows returned might already be valuable for correctness checks. I also thought about doing some sampling on the data and simulating relational operations and check for witness tuples, but it is probably not appropriate to start implementing a mini-rdbms on the client side. > but it might be worth checking for XX000 SQLSTATE responses, since > generally that should be a can't-happen case. (Or if it can happen, > we need to change the errcode.) The sqlstate is currently missing in the reports because libpqxx is not putting it in it's exceptions :-/. regards, Andreas Footnotes: [1] smith=# select * from report24h;count | error -------+--------------------------------------------------------------------------43831 | ERROR: unsupported XML feature39496| ERROR: invalid regular expression: quantifier operand invalid27261 | ERROR: canceling statement due to statementtimeout21386 | ERROR: operator does not exist: point = point 8580 | ERROR: cannot compare arrays of differentelement types 5019 | ERROR: invalid regular expression: brackets [] not balanced 4646 | ERROR: could not determinewhich collation to use for string comparison 2583 | ERROR: invalid regular expression: nfa has too many states2248 | ERROR: operator does not exist: xml = xml 1198 | ERROR: operator does not exist: polygon = polygon 1171 |ERROR: cache lookup failed for index 16862 677 | ERROR: invalid regular expression: parentheses () not balanced 172| ERROR: cache lookup failed for index 257148 84 | ERROR: could not find member 1(34520,34520) of opfamily 1976 55 | ERROR: missing support function 1(34516,34516) in opfamily 1976 42 | ERROR: operator does not exist: city_budget= city_budget 13 | ERROR: could not find commutator for operator 34538 10 | ERROR: could not identify acomparison function for type xid 4 | Connection to database failed 4 | ERROR: cache lookup failed for index 2619 3 | ERROR: plan should not reference subplan's variable 2 | ERROR: cache lookup failed for index 12322 2 |ERROR: failed to assign all NestLoopParams to plan nodes 2 | ERROR: invalid regular expression: invalid character range 1 | ERROR: could not find pathkey item to sort (25 rows) Time: 1158,990 ms [2] https://github.com/anse1/sqlsmith/blob/master/README.org
Andreas Seltenreich <seltenreich@gmx.de> writes: > Tom Lane writes: >> What concerns me more is that what you're finding is only cases that trip >> an assertion sanity check. It seems likely that you're also managing to >> trigger other bugs with less drastic consequences, such as "could not >> devise a query plan" failures or just plain wrong answers. > Ja, some of these are logged as well[1], but most of them are really as > undrastic as can get, and I was afraid reporting them would be more of a > nuisance. Well, I certainly think all of these represent bugs: > 3 | ERROR: plan should not reference subplan's variable > 2 | ERROR: failed to assign all NestLoopParams to plan nodes > 1 | ERROR: could not find pathkey item to sort This I'm not sure about; it could be that the query gave conflicting collation specifiers, but on the other hand we've definitely had bugs with people forgetting to run assign_query_collations on subexpressions: > 4646 | ERROR: could not determine which collation to use for string comparison This one's pretty darn odd, because 2619 is pg_statistic and not an index at all: > 4 | ERROR: cache lookup failed for index 2619 These seem likely to be bugs as well, though maybe they are race conditions during a DROP and not worth fixing: > 1171 | ERROR: cache lookup failed for index 16862 > 172 | ERROR: cache lookup failed for index 257148 > 84 | ERROR: could not find member 1(34520,34520) of opfamily 1976 > 55 | ERROR: missing support function 1(34516,34516) in opfamily 1976 > 13 | ERROR: could not find commutator for operator 34538 > 2 | ERROR: cache lookup failed for index 12322 I would say anything of the sort that is repeatable definitely deserves investigation, because even if it's an expectable error condition, we should be throwing a more user-friendly error message. regards, tom lane
Tom Lane writes: > Well, I certainly think all of these represent bugs: > > [...] thanks for priorizing them. I'll try to digest them somewhat before posting. > This one's pretty darn odd, because 2619 is pg_statistic and not an index > at all: > >> 4 | ERROR: cache lookup failed for index 2619 This is actually the one from the README :-). Quoting to spare media discontinuity: --8<---------------cut here---------------start------------->8--- Taking a closer look at it reveals that it happens when you query a certain catalog view like this: self=# select indexdef from pg_catalog.pg_indexes where indexdef is not NULL; FEHLER: cache lookup failed for index 2619 This is because the planner then puts pg_get_indexdef(oid) in a context where it sees non-index-oids, which causes it to croak: QUERY PLAN ------------------------------------------------------------------------------------Hash Join (cost=17.60..30.65 rows=9width=4) Hash Cond: (i.oid = x.indexrelid) -> Seq Scan on pg_class i (cost=0.00..12.52 rows=114 width=8) Filter: ((pg_get_indexdef(oid) IS NOT NULL) AND (relkind = 'i'::"char")) -> Hash (cost=17.31..17.31 rows=23 width=4) -> Hash Join (cost=12.52..17.31 rows=23 width=4) Hash Cond: (x.indrelid = c.oid) -> Seq Scan on pg_index x (cost=0.00..4.13 rows=113 width=8) -> Hash (cost=11.76..11.76 rows=61 width=8) -> Seq Scan on pg_class c (cost=0.00..11.76 rows=61 width=8) Filter:(relkind = ANY ('{r,m}'::"char"[])) --8<---------------cut here---------------end--------------->8--- thanks, Andreas
On 08/01/2015 05:59 PM, Tom Lane wrote: > Well, I certainly think all of these represent bugs: > >> 1 | ERROR: could not find pathkey item to sort sqlsmith was able to find these two queries that trigger the error on my machine: ERROR: could not find pathkey item to sort STATEMENT: select rel_16564180.id1 as c0 from public.bprime as rel_16564178 left join(select rel_16564179.b as c0, rel_16564179.c as c1 from public.clstr_tst as rel_16564179 where rel_16564179.c < rel_16564179.d) as subq_2610919 inner join public.num_result as rel_16564180 on (subq_2610919.c0 = rel_16564180.id1 ) on(rel_16564178.thousand = subq_2610919.c0 ) where subq_2610919.c1 = subq_2610919.c1 fetch first 122 rows only; STATEMENT: select subq_1991714.c0 as c0, subq_1991712.c0 as c1, rel_12624817.a as c2 from (select rel_12624653.id2 as c0, rel_12624653.id2 as c1 from public.num_exp_add as rel_12624653 where rel_12624653.expected >= rel_12624653.expected fetch first 137 rows only) as subq_1991697 inner join (select rel_12624805.z as c0 from public.insert_tbl as rel_12624805 whererel_12624805.y <= rel_12624805.y fetch first 108 rows only) as subq_1991712 left joinpublic.clstr_tst as rel_12624817 left join public.main_table as rel_12624818 on (rel_12624817.b= rel_12624818.a ) on (subq_1991712.c0 = rel_12624818.a ) on (subq_1991697.c0 = rel_12624818.a), lateral (select rel_12624819.tgdeferrable as c0 from pg_catalog.pg_trigger as rel_12624819 where (rel_12624819.tgconstraint >= rel_12624819.tgrelid) and ((rel_12624819.tgconstrindid <= rel_12624819.tgrelid) or (EXISTS ( select rel_12624820.grantor as c0, rel_12624820.grantee as c1, rel_12624820.is_grantable as c2, rel_12624820.routine_schema as c3, rel_12624820.specific_schemaas c4, rel_12624820.grantee as c5, rel_12624820.routine_catalogas c6, rel_12624820.routine_schema as c7, rel_12624820.specific_nameas c8 from information_schema.role_routine_grants as rel_12624820 where 22 >= 29 fetch first 138 rows only))) fetch first93 rows only) as subq_1991714 where rel_12624817.d = rel_12624817.c fetch first 97 rows only;
On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > sqlsmith triggered the following assertion in master (c188204). Thanks for writing sqlsmith. It seems like a great tool. I wonder, are you just running the tool with assertions enabled when PostgreSQL is built? If so, it might make sense to make various problems more readily detected. As you may know, Clang has a pretty decent option called AddressSanitizer that can detect memory errors as they occur with an overhead that is not excessive. One might use the following configure arguments when building PostgreSQL to use AddressSanitizer: ./configure CC=clang CFLAGS='-O1 -g -fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert Of course, it remains to be seen if this pays for itself. Apparently the tool has about a 2x overhead [1]. I'm really not sure that you'll find any more bugs this way, but it's certainly possible that you'll find a lot more. Given your success in finding bugs without using AddressSanitizer, introducing it may be premature. [1] http://clang.llvm.org/docs/AddressSanitizer.html#introduction -- Peter Geoghegan
[sqlsmith] subplan variable reference / unassigned NestLoopParams (was: [sqlsmith] Failed assertion in joinrels.c)
Tom Lane writes: > Well, I certainly think all of these represent bugs: > >> 3 | ERROR: plan should not reference subplan's variable >> 2 | ERROR: failed to assign all NestLoopParams to plan nodes These appear to be related. The following query produces the former, but if you replace the very last reference of provider with the literal 'bar', it raises the latter error. select 1 from pg_catalog.pg_shseclabel as rel_09 inner join public.rtest_view2 as rel_32 left join pg_catalog.pg_rolesas rel_33 on (rel_32.a = rel_33.rolconnlimit ) on (rel_09.provider = rel_33.rolpassword ) leftjoin pg_catalog.pg_user as rel_35 on (rel_33.rolconfig = rel_35.useconfig ) where ( ((rel_09.provider ~<~ 'foo') and (rel_35.usename ~* rel_09.provider))); ,----[ FWIW: git bisect run ] | first bad commit: [e83bb10d6dcf05a666d4ada00d9788c7974ad378] | Adjust definition of cheapest_total_path to work better with LATERAL. `---- regards, Andreas
Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: > On 08/01/2015 05:59 PM, Tom Lane wrote: >> Well, I certainly think all of these represent bugs: >>> 1 | ERROR: could not find pathkey item to sort > sqlsmith was able to find these two queries that trigger the error on my > machine: Hmm ... I see no error with these queries as of today's HEAD or back-branch tips. I surmise that this was triggered by one of the other recently-fixed bugs, though the connection isn't obvious offhand. regards, tom lane
On 08/02/2015 10:18 PM, Tom Lane wrote: > Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: >> On 08/01/2015 05:59 PM, Tom Lane wrote: >>> Well, I certainly think all of these represent bugs: >>>> 1 | ERROR: could not find pathkey item to sort > >> sqlsmith was able to find these two queries that trigger the error >> on my machine: > > Hmm ... I see no error with these queries as of today's HEAD or > back-branch tips. I surmise that this was triggered by one of the > other recently-fixed bugs, though the connection isn't obvious > offhand. Yeah, I'm no longer able to reproduce it either.
On 08/01/2015 05:59 PM, Tom Lane wrote: > Well, I certainly think all of these represent bugs: How about this one? 1 ERROR: could not find RelOptInfo for given relids It's triggered on 13bba02271dce865cd20b6f49224889c73fed4e7 by this query and the attached one: select 1 from public.nums as r5006875 inner join information_schema.domains as r5006876 on (r5006875.n = r5006876.character_maximum_length ) inner join (select r5006878.z as c5 from testxmlschema.test2 as r5006878 where r5006878.q >= r5006878.q) as subq_771817 on (r5006875.n = subq_771817.c5 ) right join public.shoelace_arrive as r5006879 on (r5006875.n = r5006879.arr_quant ) left join pg_catalog.pg_tablespace as r5006966 left join pg_catalog.pg_stat_user_functions as r5006967 on (r5006966.spcowner = r5006967.funcid ) inner join pg_catalog.pg_authid as r5006968 on (r5006967.funcname = r5006968.rolname ) on (subq_771817.c5 = r5006968.rolconnlimit ) where r5006875.n is NULL;
Attachment
Peter Geoghegan writes: > On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote: >> sqlsmith triggered the following assertion in master (c188204). > > Thanks for writing sqlsmith. It seems like a great tool. > > I wonder, are you just running the tool with assertions enabled when > PostgreSQL is built? Right. I have to admit my testing setup is still more tailored towards testing sqlsmith than postgres. > If so, it might make sense to make various problems more readily > detected. As you may know, Clang has a pretty decent option called > AddressSanitizer that can detect memory errors as they occur with an > overhead that is not excessive. I didn't known this clang feature yet, thanks for pointing it out. I considered running some instances under valgrind to detect these, but the performance penalty seemed not worth it. > One might use the following configure arguments when building > PostgreSQL to use AddressSanitizer: > > ./configure CC=clang CFLAGS='-O1 -g -fsanitize=address > -fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert A quick attempt to sneak these in made my ansible playbooks unhappy due to "make check" failures and other generated noise. I'll try to have an instance with the AddressSanitizer active soon though. > Of course, it remains to be seen if this pays for itself. Apparently > the tool has about a 2x overhead [1]. I'm really not sure that you'll > find any more bugs this way, but it's certainly possible that you'll > find a lot more. Given your success in finding bugs without using > AddressSanitizer, introducing it may be premature. Piotr also suggested on IRC to run coverage tests w/ sqlsmith. This could yield valuable hints in which direction to extend sqlsmith's grammar. Thanks, Andreas
Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: > How about this one? > 1 ERROR: could not find RelOptInfo for given relids That would be a bug, for sure ... > It's triggered on 13bba02271dce865cd20b6f49224889c73fed4e7 by this query > and the attached one: ... but I can't reproduce it on HEAD with either of these queries. Not clear why you're getting different results. regards, tom lane
On 08/03/2015 09:18 PM, Tom Lane wrote: > ... but I can't reproduce it on HEAD with either of these queries. > Not clear why you're getting different results. I'm terribly sorry, but I didn't notice that postgresql.conf was modified... Set join_collapse_limit = 32 and you should see the error.
Re: [sqlsmith] subplan variable reference / unassigned NestLoopParams (was: [sqlsmith] Failed assertion in joinrels.c)
Andreas Seltenreich <seltenreich@gmx.de> writes: > Tom Lane writes: >> Well, I certainly think all of these represent bugs: >>> 3 | ERROR: plan should not reference subplan's variable >>> 2 | ERROR: failed to assign all NestLoopParams to plan nodes > These appear to be related. The following query produces the former, > but if you replace the very last reference of provider with the literal > 'bar', it raises the latter error. Fixed that, thanks for the test case! > ,----[ FWIW: git bisect run ] > | first bad commit: [e83bb10d6dcf05a666d4ada00d9788c7974ad378] > | Adjust definition of cheapest_total_path to work better with LATERAL. > `---- There's still something fishy about your git bisect results; they don't have much to do with what seems to me to be the triggering condition. I suspect the problem is that git bisect doesn't allow for the possibility that the symptom might appear and disappear over time, ie it might have been visible at some early stage of the LATERAL work but been fixed later, and then reintroduced by still-later optimization efforts. regards, tom lane
leak:save_ps_display_args
leak:__GI___strdup
$ export LSAN_OPTIONS=suppressions=leak.supp
AST stats (avg): height = 7.29877 nodes = 37.8156
296 ERROR: canceling statement due to statement timeout
166 ERROR: invalid regular expression: quantifier operand invalid
26 ERROR: could not determine which collation to use for string comparison
23 ERROR: cannot compare arrays of different element types
12 ERROR: invalid regular expression: brackets [] not balanced
5 ERROR: cache lookup failed for index 2619
2 ERROR: invalid regular expression: parentheses () not balanced
error rate: 0.0104921
Suppressions used:
count bytes template
1 520 save_ps_display_args
1 10 __GI___strdup
-----------------------------------------------------
From: Peter Geoghegan <pg@heroku.com>
To: Andreas Seltenreich <seltenreich@gmx.de>
Cc: Pg Hackers <pgsql-hackers@postgresql.org>
Sent: Sunday, 2 August 2015, 10:39
Subject: Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> sqlsmith triggered the following assertion in master (c188204).
Thanks for writing sqlsmith. It seems like a great tool.
I wonder, are you just running the tool with assertions enabled when
PostgreSQL is built? If so, it might make sense to make various
problems more readily detected. As you may know, Clang has a pretty
decent option called AddressSanitizer that can detect memory errors as
they occur with an overhead that is not excessive. One might use the
following configure arguments when building PostgreSQL to use
AddressSanitizer:
./configure CC=clang CFLAGS='-O1 -g -fsanitize=address
-fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert
Of course, it remains to be seen if this pays for itself. Apparently
the tool has about a 2x overhead [1]. I'm really not sure that you'll
find any more bugs this way, but it's certainly possible that you'll
find a lot more. Given your success in finding bugs without using
AddressSanitizer, introducing it may be premature.
[1] http://clang.llvm.org/docs/AddressSanitizer.html#introduction
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: > On 08/03/2015 09:18 PM, Tom Lane wrote: >> ... but I can't reproduce it on HEAD with either of these queries. >> Not clear why you're getting different results. > I'm terribly sorry, but I didn't notice that postgresql.conf was modified... > Set join_collapse_limit = 32 and you should see the error. Ah ... now I get that error on the smaller query, but the larger one (that you put in an attachment) still doesn't show any problem. Do you have any other nondefault settings? regards, tom lane
On 08/05/2015 02:24 AM, Tom Lane wrote: > Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: >> On 08/03/2015 09:18 PM, Tom Lane wrote: >>> ... but I can't reproduce it on HEAD with either of these queries. >>> Not clear why you're getting different results. > >> I'm terribly sorry, but I didn't notice that postgresql.conf was modified... >> Set join_collapse_limit = 32 and you should see the error. > > Ah ... now I get that error on the smaller query, but the larger one > (that you put in an attachment) still doesn't show any problem. > Do you have any other nondefault settings? Sorry, that needs from_collapse_limit = 32 as well.
Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: > On 08/05/2015 02:24 AM, Tom Lane wrote: >> Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: >>> Set join_collapse_limit = 32 and you should see the error. >> Ah ... now I get that error on the smaller query, but the larger one >> (that you put in an attachment) still doesn't show any problem. >> Do you have any other nondefault settings? > Sorry, that needs from_collapse_limit = 32 as well. Yeah, I assumed as much, but it still doesn't happen for me. Possibly something platform-dependent in statistics, or something like that. Anyway, I fixed the problem exposed by the smaller query; would you check that the larger one is okay for you now? regards, tom lane
Tom Lane writes: >> On 08/01/2015 05:59 PM, Tom Lane wrote: >>> Well, I certainly think all of these represent bugs: >>>> 1 | ERROR: could not find pathkey item to sort > > Hmm ... I see no error with these queries as of today's HEAD or > back-branch tips. I surmise that this was triggered by one of the other > recently-fixed bugs, though the connection isn't obvious offhand. I still see this error in master as of b8cbe43, but the queries are indeed a pita to reproduce. The one below is the only one so far that is robust against running ANALYZE on the regression db, and also reproduces when I run it as an EXTRA_TEST with make check. regards, Andreas select rel_217088662.a as c0, rel_217088554.a as c1, rel_217088662.b as c2, subq_34235266.c0 as c3, rel_217088660.id2 asc4, rel_217088660.id2 as c5 from public.clstr_tst as rel_217088554 inner join (select rel_217088628.a as c0 from public.rtest_vview3as rel_217088628 where (rel_217088628.b !~ rel_217088628.b) and ((((rel_217088628.b ~~* rel_217088628.b) or (rel_217088628.b ~* rel_217088628.b)) or (rel_217088628.b <> rel_217088628.b)) or (rel_217088628.b= rel_217088628.b))) as subq_34235266 inner join public.num_exp_mul as rel_217088660 inner join public.onek2as rel_217088661 on (rel_217088660.id1 = rel_217088661.unique1 ) on (subq_34235266.c0 = rel_217088660.id1 ) inner join public.main_table as rel_217088662 on (rel_217088661.unique2 = rel_217088662.a ) on (rel_217088554.b= rel_217088660.id1 ) where rel_217088554.d = rel_217088554.c fetch first 94 rows only;
On 08/05/2015 08:43 PM, Tom Lane wrote: > Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: >> On 08/05/2015 02:24 AM, Tom Lane wrote: >>> Piotr Stefaniak <postgres@piotr-stefaniak.me> writes: >>>> Set join_collapse_limit = 32 and you should see the error. > >>> Ah ... now I get that error on the smaller query, but the larger one >>> (that you put in an attachment) still doesn't show any problem. >>> Do you have any other nondefault settings? > >> Sorry, that needs from_collapse_limit = 32 as well. > > Yeah, I assumed as much, but it still doesn't happen for me. Possibly > something platform-dependent in statistics, or something like that. > > Anyway, I fixed the problem exposed by the smaller query; would you > check that the larger one is okay for you now? I can't reproduce the error in either case, so that seems to be fixed now. Thanks!
Andreas Seltenreich <seltenreich@gmx.de> writes: > Tom Lane writes: >> On 08/01/2015 05:59 PM, Tom Lane wrote: >>> Well, I certainly think all of these represent bugs: >>> 1 | ERROR: could not find pathkey item to sort >> Hmm ... I see no error with these queries as of today's HEAD or >> back-branch tips. I surmise that this was triggered by one of the other >> recently-fixed bugs, though the connection isn't obvious offhand. > I still see this error in master as of b8cbe43, but the queries are > indeed a pita to reproduce. The one below is the only one so far that > is robust against running ANALYZE on the regression db, and also > reproduces when I run it as an EXTRA_TEST with make check. Got that one, thanks! But we've seen this error message before in all sorts of weird situations, so I wouldn't assume that all the cases you've come across had the same cause. regards, tom lane
Re: [sqlsmith] subplan variable reference / unassigned NestLoopParams
Tom Lane writes: > Andreas Seltenreich <seltenreich@gmx.de> writes: >> Tom Lane writes: >>> Well, I certainly think all of these represent bugs: >>>> 3 | ERROR: plan should not reference subplan's variable >>>> 2 | ERROR: failed to assign all NestLoopParams to plan nodes > >> These appear to be related. The following query produces the former, >> but if you replace the very last reference of provider with the literal >> 'bar', it raises the latter error. > > Fixed that, thanks for the test case! I haven't seen the former since your commit, but there recently were some instances of the latter. The attached queries all throw the error against master at 8752bbb. regards, Andreas
Attachment
Andreas Seltenreich <seltenreich@gmx.de> writes: > Tom Lane writes: >> Andreas Seltenreich <seltenreich@gmx.de> writes: >>> Tom Lane writes: >>>> Well, I certainly think all of these represent bugs: >>>> 3 | ERROR: plan should not reference subplan's variable >>>> 2 | ERROR: failed to assign all NestLoopParams to plan nodes >>> These appear to be related. The following query produces the former, >>> but if you replace the very last reference of provider with the literal >>> 'bar', it raises the latter error. >> Fixed that, thanks for the test case! > I haven't seen the former since your commit, but there recently were > some instances of the latter. The attached queries all throw the error > against master at 8752bbb. Turns out my patch only addressed some cases of the underlying issue. I've gone back for a second swipe at it; thanks for the continued testing! BTW, the commit hashes you've been mentioning don't seem to correspond to anything in the master PG repo ... are you working with modified sources? regards, tom lane
On Fri, Aug 7, 2015 at 9:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andreas Seltenreich <seltenreich@gmx.de> writes: >> Tom Lane writes: >>> On 08/01/2015 05:59 PM, Tom Lane wrote: >>>> Well, I certainly think all of these represent bugs: >>>> 1 | ERROR: could not find pathkey item to sort > >>> Hmm ... I see no error with these queries as of today's HEAD or >>> back-branch tips. I surmise that this was triggered by one of the other >>> recently-fixed bugs, though the connection isn't obvious offhand. > >> I still see this error in master as of b8cbe43, but the queries are >> indeed a pita to reproduce. The one below is the only one so far that >> is robust against running ANALYZE on the regression db, and also >> reproduces when I run it as an EXTRA_TEST with make check. > > Got that one, thanks! But we've seen this error message before in all > sorts of weird situations, so I wouldn't assume that all the cases you've > come across had the same cause. (resurrecting an old thread) This is still failing: =# select indexdef from pg_catalog.pg_indexes where indexdef is not NULL; ERROR: XX000: cache lookup failed for index 2619 LOCATION: pg_get_indexdef_worker, ruleutils.c:1054 Do we want to improve at least the error message? -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > This is still failing: > =# select indexdef from pg_catalog.pg_indexes where indexdef is not NULL; > ERROR: XX000: cache lookup failed for index 2619 > LOCATION: pg_get_indexdef_worker, ruleutils.c:1054 > Do we want to improve at least the error message? Maybe we should just make the function silently return NULL if the OID doesn't refer to an index relation. There's precedent for that sort of behavior ... regards, tom lane
On Sun, Jun 5, 2016 at 12:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> This is still failing: >> =# select indexdef from pg_catalog.pg_indexes where indexdef is not NULL; >> ERROR: XX000: cache lookup failed for index 2619 >> LOCATION: pg_get_indexdef_worker, ruleutils.c:1054 >> Do we want to improve at least the error message? > > Maybe we should just make the function silently return NULL if the OID > doesn't refer to an index relation. There's precedent for that sort > of behavior ... For views it is simply returned "Not a view", and for rules that's "-". All the others behave like similarly to indexes: =# select pg_get_constraintdef(0); ERROR: XX000: cache lookup failed for constraint 0 =# select pg_get_triggerdef(0); ERROR: XX000: could not find tuple for trigger 0 =# select pg_get_functiondef(0); ERROR: XX000: cache lookup failed for function 0 =# select pg_get_viewdef(0);pg_get_viewdef ----------------Not a view (1 row) =# select pg_get_ruledef(0);pg_get_ruledef ----------------- (1 row) So yes we could just return NULL for all of them, or make them throw an error. Anyway, my vote goes for a HEAD-only change, and just throw NULL... This way an application still has the option to fallback to something else with a CASE at SQL level without using plpgsql to catch the error. Also, I have not monitored the code much regarding some code paths that rely on the existing rules, but I would imagine that there are things depending on the current behavior as well. -- Michael
On Sun, Jun 5, 2016 at 4:28 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Jun 5, 2016 at 12:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Michael Paquier <michael.paquier@gmail.com> writes: >>> This is still failing: >>> =# select indexdef from pg_catalog.pg_indexes where indexdef is not NULL; >>> ERROR: XX000: cache lookup failed for index 2619 >>> LOCATION: pg_get_indexdef_worker, ruleutils.c:1054 >>> Do we want to improve at least the error message? >> >> Maybe we should just make the function silently return NULL if the OID >> doesn't refer to an index relation. There's precedent for that sort >> of behavior ... > > For views it is simply returned "Not a view", and for rules that's > "-". All the others behave like similarly to indexes: > =# select pg_get_constraintdef(0); > ERROR: XX000: cache lookup failed for constraint 0 > =# select pg_get_triggerdef(0); > ERROR: XX000: could not find tuple for trigger 0 > =# select pg_get_functiondef(0); > ERROR: XX000: cache lookup failed for function 0 > =# select pg_get_viewdef(0); > pg_get_viewdef > ---------------- > Not a view > (1 row) > =# select pg_get_ruledef(0); > pg_get_ruledef > ---------------- > - > (1 row) > > So yes we could just return NULL for all of them, or make them throw > an error. Anyway, my vote goes for a HEAD-only change, and just throw > NULL... This way an application still has the option to fallback to > something else with a CASE at SQL level without using plpgsql to catch > the error. > > Also, I have not monitored the code much regarding some code paths > that rely on the existing rules, but I would imagine that there are > things depending on the current behavior as well. Still, on back branches I think that it would be a good idea to have a better error message for the ones that already throw an error, like "trigger with OID %u does not exist". Switching from elog to ereport would be a good idea, but wouldn't it be a problem to change what is user-visible? -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > Still, on back branches I think that it would be a good idea to have a > better error message for the ones that already throw an error, like > "trigger with OID %u does not exist". Switching from elog to ereport > would be a good idea, but wouldn't it be a problem to change what is > user-visible? If we're going to touch this behavior in the back branches, I would vote for changing it the same as we do in HEAD. I do not think that making a user-visible behavior change in a minor release and then a different change in the next major is good strategy. But, given the relative shortage of complaints from the field, I'd be more inclined to just do nothing in back branches. There might be people out there with code depending on the current behavior, and they'd be annoyed if we change it in a minor release. regards, tom lane
On Sun, Jun 5, 2016 at 11:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Still, on back branches I think that it would be a good idea to have a >> better error message for the ones that already throw an error, like >> "trigger with OID %u does not exist". Switching from elog to ereport >> would be a good idea, but wouldn't it be a problem to change what is >> user-visible? > > If we're going to touch this behavior in the back branches, I would > vote for changing it the same as we do in HEAD. I do not think that > making a user-visible behavior change in a minor release and then a > different change in the next major is good strategy. So, I have finished with the patch attached that changes all the *def() functions to return NULL when an object does not exist. Some callers of the index and constraint definitions still expect a cache lookup error if the object does not exist, and this patch is careful about that. I think that it would be nice to get that in 9.6, but I won't fight hard for it either. So I am parking it for now in the upcoming CF. > But, given the relative shortage of complaints from the field, I'd > be more inclined to just do nothing in back branches. There might > be people out there with code depending on the current behavior, > and they'd be annoyed if we change it in a minor release. Sure. That's the safest position. Thinking about the existing behavior for some of the index and constraint callers, even just changing the error message does not bring much as some of them really expect to fail with a cache lookup error. -- Michael
Attachment
On Mon, Jun 6, 2016 at 3:30 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Jun 5, 2016 at 11:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Michael Paquier <michael.paquier@gmail.com> writes: >>> Still, on back branches I think that it would be a good idea to have a >>> better error message for the ones that already throw an error, like >>> "trigger with OID %u does not exist". Switching from elog to ereport >>> would be a good idea, but wouldn't it be a problem to change what is >>> user-visible? >> >> If we're going to touch this behavior in the back branches, I would >> vote for changing it the same as we do in HEAD. I do not think that >> making a user-visible behavior change in a minor release and then a >> different change in the next major is good strategy. > > So, I have finished with the patch attached that changes all the > *def() functions to return NULL when an object does not exist. Some > callers of the index and constraint definitions still expect a cache > lookup error if the object does not exist, and this patch is careful > about that. I think that it would be nice to get that in 9.6, but I > won't fight hard for it either. So I am parking it for now in the > upcoming CF. > >> But, given the relative shortage of complaints from the field, I'd >> be more inclined to just do nothing in back branches. There might >> be people out there with code depending on the current behavior, >> and they'd be annoyed if we change it in a minor release. > > Sure. That's the safest position. Thinking about the existing behavior > for some of the index and constraint callers, even just changing the > error message does not bring much as some of them really expect to > fail with a cache lookup error. Committed with minor kibitizing: you don't need an "else" after a statement that transfers control out of the function. Shouldn't pg_get_function_arguments, pg_get_function_identity_arguments, pg_get_function_result, and pg_get_function_arg_default get the same treatment? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 27, 2016 at 5:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Committed with minor kibitizing: you don't need an "else" after a > statement that transfers control out of the function. Thanks. Right, I forgot that. > Shouldn't > pg_get_function_arguments, pg_get_function_identity_arguments, > pg_get_function_result, and pg_get_function_arg_default get the same > treatment? Changing all of them make sense. Please see attached. While looking at the series of functions pg_get_*, I have noticed as well that pg_get_userbyid() returns "unknown (OID=%u)" when it does not know a user. Perhaps we'd want to change that to NULL for consistency with the rest? -- Michael
Attachment
On Tue, Jul 26, 2016 at 9:27 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jul 27, 2016 at 5:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Committed with minor kibitizing: you don't need an "else" after a >> statement that transfers control out of the function. > > Thanks. Right, I forgot that. > >> Shouldn't >> pg_get_function_arguments, pg_get_function_identity_arguments, >> pg_get_function_result, and pg_get_function_arg_default get the same >> treatment? > > Changing all of them make sense. Please see attached. Committed. > While looking at the series of functions pg_get_*, I have noticed as > well that pg_get_userbyid() returns "unknown (OID=%u)" when it does > not know a user. Perhaps we'd want to change that to NULL for > consistency with the rest? That's probably correct in theory, but it's a little less closely related, and I'm not entirely sure how far we want to go with this. Remember, the original purpose was to avoid having an internal error (cache lookup failed, XX000) exposed as a user-visible error message. Are we at risk from veering from actual bug-fixing off into useless tinkering? Not sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 26, 2016 at 9:27 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> While looking at the series of functions pg_get_*, I have noticed as >> well that pg_get_userbyid() returns "unknown (OID=%u)" when it does >> not know a user. Perhaps we'd want to change that to NULL for >> consistency with the rest? > That's probably correct in theory, but it's a little less closely > related, and I'm not entirely sure how far we want to go with this. > Remember, the original purpose was to avoid having an internal error > (cache lookup failed, XX000) exposed as a user-visible error message. > Are we at risk from veering from actual bug-fixing off into useless > tinkering? Not sure. I'd vote for leaving that one alone; yeah, it's a bit inconsistent now, but no one has complained about its behavior. regards, tom lane
On Sat, Jul 30, 2016 at 1:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jul 26, 2016 at 9:27 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> While looking at the series of functions pg_get_*, I have noticed as >>> well that pg_get_userbyid() returns "unknown (OID=%u)" when it does >>> not know a user. Perhaps we'd want to change that to NULL for >>> consistency with the rest? > >> That's probably correct in theory, but it's a little less closely >> related, and I'm not entirely sure how far we want to go with this. >> Remember, the original purpose was to avoid having an internal error >> (cache lookup failed, XX000) exposed as a user-visible error message. >> Are we at risk from veering from actual bug-fixing off into useless >> tinkering? Not sure. > > I'd vote for leaving that one alone; yeah, it's a bit inconsistent > now, but no one has complained about its behavior. OK for me. Thanks for the commit. -- Michael
OK for me. Thanks for the commit.
There are many more such exposed functions, which can throw cache lookup failure error if we pass wrong value.
ERROR: cache lookup failed for procedure 0
postgres=# select
postgres-# pg_catalog.record_in(
postgres(# cast(pg_catalog.numerictypmodout(
postgres(# cast(98 as integer)) as cstring),
postgres(# cast(1 as oid),
postgres(# cast(35 as integer));
ERROR: cache lookup failed for type 1
There are many more such exposed functions, which can throw cache lookup failure error if we pass wrong value.i.e.record_indomain_infmgr_c_validatoredb_get_objecttypeheaddefplpgsql_validatorpg_procedure_is_visibleAre we planning to change these also..below query is one example (from sqlsmith).ERROR: cache lookup failed for procedure 0
postgres=# select
postgres-# pg_catalog.record_in(
postgres(# cast(pg_catalog.numerictypmodout(
postgres(# cast(98 as integer)) as cstring),
postgres(# cast(1 as oid),
postgres(# cast(35 as integer));
ERROR: cache lookup failed for type 1
So actual list is as below..
On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > There are many more such exposed functions, which can throw cache lookup > failure error if we pass wrong value. > > i.e. > record_in > domain_in > fmgr_c_validator > plpgsql_validator > pg_procedure_is_visible > > Are we planning to change these also.. I think if they are SQL-callable functions that users can invoke from a SQL prompt, they shouldn't throw "cache lookup failed" errors or, for that matter, any other error that is reported with elog(). Now, I don't necessarily think that these functions should return NULL in that case the way we did with the others; that's not a sensible behavior for a type-input function AFAIK. But we should emit a better error. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> There are many more such exposed functions, which can throw cache lookup >> failure error if we pass wrong value. >> >> i.e. >> record_in >> domain_in >> fmgr_c_validator >> plpgsql_validator >> pg_procedure_is_visible >> >> Are we planning to change these also.. > I think if they are SQL-callable functions that users can invoke from > a SQL prompt, they shouldn't throw "cache lookup failed" errors or, > for that matter, any other error that is reported with elog(). FWIW, I would restrict that to "functions that users are *intended* to call from a SQL prompt". If you are fooling around calling internal functions with garbage input, you should not be surprised to get an internal error back. So of the above list, only pg_procedure_is_visible seems particularly worth changing to me. And for it, returning NULL (ie, "unknown") seems reasonable enough. Actually, it looks to me like we already fixed that for the built-in is_visible functions: # select pg_function_is_visible(0) is null;?column? ----------t (1 row) There's no such animal as pg_procedure_is_visible. I suspect that's an EDB-ism that hasn't caught up with commit 66bb74dbe8206a35. regards, tom lane
On Tue, Aug 2, 2016 at 7:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> There are many more such exposed functions, which can throw cache lookup >>> failure error if we pass wrong value. >>> >>> i.e. >>> record_in >>> domain_in >>> fmgr_c_validator >>> plpgsql_validator >>> pg_procedure_is_visible >>> >>> Are we planning to change these also.. > >> I think if they are SQL-callable functions that users can invoke from >> a SQL prompt, they shouldn't throw "cache lookup failed" errors or, >> for that matter, any other error that is reported with elog(). > > FWIW, I would restrict that to "functions that users are *intended* to > call from a SQL prompt". If you are fooling around calling internal > functions with garbage input, you should not be surprised to get an > internal error back. So of the above list, only pg_procedure_is_visible > seems particularly worth changing to me. And for it, returning NULL > (ie, "unknown") seems reasonable enough. I think that's just making life difficult. If nothing else, sqlsmith hunts around for functions it can call that return internal errors, and if we refuse to fix all of them to return user-facing errors, then it's just crap for the people running sqlsmith to sift through and it's a judgment call whether to fix each particular case. Even aside from that, I think it's much better to have a clear and unambiguous rule that elog is only for can't-happen things, not we-don't-recommend-it things. > There's no such animal as pg_procedure_is_visible. I suspect that's an > EDB-ism that hasn't caught up with commit 66bb74dbe8206a35. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 2, 2016 at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think that's just making life difficult. If nothing else, sqlsmith > hunts around for functions it can call that return internal errors, > and if we refuse to fix all of them to return user-facing errors, then > it's just crap for the people running sqlsmith to sift through and > it's a judgment call whether to fix each particular case. Even aside > from that, I think it's much better to have a clear and unambiguous > rule that elog is only for can't-happen things, not > we-don't-recommend-it things. +1. This also has value in the context of automatically surfacing situations where "can't happen" errors do in fact happen at scale. -- Peter Geoghegan
I think that's just making life difficult. If nothing else, sqlsmith
hunts around for functions it can call that return internal errors,
and if we refuse to fix all of them to return user-facing errors, then
it's just crap for the people running sqlsmith to sift through and
it's a judgment call whether to fix each particular case. Even aside
from that, I think it's much better to have a clear and unambiguous
rule that elog is only for can't-happen things, not
we-don't-recommend-it things.
I have changed for all these function to report more appropriate error with ereport.
Attachment
On Mon, Aug 8, 2016 at 4:58 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> I think that's just making life difficult. If nothing else, sqlsmith >> hunts around for functions it can call that return internal errors, >> and if we refuse to fix all of them to return user-facing errors, then >> it's just crap for the people running sqlsmith to sift through and >> it's a judgment call whether to fix each particular case. Even aside >> from that, I think it's much better to have a clear and unambiguous >> rule that elog is only for can't-happen things, not >> we-don't-recommend-it things. > > > I have changed for all these function to report more appropriate error with > ereport. > > I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors. > I think this is closest error code among all existing error codes, other > options can be (ERRCODE_WRONG_OBJECT_TYPE). > Your other options and the option you choose are same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Aug 8, 2016 at 5:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Aug 8, 2016 at 4:58 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> >> On Wed, Aug 3, 2016 at 5:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> I think that's just making life difficult. If nothing else, sqlsmith >>> hunts around for functions it can call that return internal errors, >>> and if we refuse to fix all of them to return user-facing errors, then >>> it's just crap for the people running sqlsmith to sift through and >>> it's a judgment call whether to fix each particular case. Even aside >>> from that, I think it's much better to have a clear and unambiguous >>> rule that elog is only for can't-happen things, not >>> we-don't-recommend-it things. >> >> >> I have changed for all these function to report more appropriate error with >> ereport. >> >> I used ERRCODE_WRONG_OBJECT_TYPE error code for reporting such errors. >> I think this is closest error code among all existing error codes, other >> options can be (ERRCODE_WRONG_OBJECT_TYPE). >> > > Your other options and the option you choose are same. > Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages like: "type %u does not exit" or "type id %u does not exit"? Errorcode ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure cases at certain places in code. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
>
> Your other options and the option you choose are same.
>
Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages
like: "type %u does not exit" or "type id %u does not exit"? Errorcode
ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure
cases at certain places in code.
On Mon, Aug 8, 2016 at 6:00 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Mon, Aug 8, 2016 at 5:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> Did you consider to use ERRCODE_UNDEFINED_COLUMN with error messages >> like: "type %u does not exit" or "type id %u does not exit"? Errorcode >> ERRCODE_UNDEFINED_COLUMN seems to be used for syscache lookup failure >> cases at certain places in code. > > > But I think, OBJECT will be more appropriate than COLUMN at this place. > Okay, then how about ERRCODE_UNDEFINED_OBJECT? It seems to be used in almost similar cases. > However I can change error message to "type id %u does not exit" if this > seems better ? > I think that is better. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 9, 2016 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, then how about ERRCODE_UNDEFINED_OBJECT? It seems to be used in > almost similar cases. This seems better, after checking at other places I found that for invalid type we are using ERRCODE_UNDEFINED_OBJECT and for invalid functions we are using ERRCODE_UNDEFINED_FUNCTION. So I have done the same way. Updated patch attached. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 10, 2016 at 10:04 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > This seems better, after checking at other places I found that for > invalid type we are using ERRCODE_UNDEFINED_OBJECT and for invalid > functions we are using ERRCODE_UNDEFINED_FUNCTION. So I have done the > same way. > > Updated patch attached. I found some more places where we can get similar error and updated in my v3 patch. I don't claim that it fixes all such kind of error, but at least it fixes error reported by sqlsmith plus some additional what I found in similar area. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 10, 2016 at 11:27 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Wed, Aug 10, 2016 at 10:04 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> This seems better, after checking at other places I found that for >> invalid type we are using ERRCODE_UNDEFINED_OBJECT and for invalid >> functions we are using ERRCODE_UNDEFINED_FUNCTION. So I have done the >> same way. >> >> Updated patch attached. > > I found some more places where we can get similar error and updated in > my v3 patch. > @@ -412,7 +412,9 @@ plpgsql_validator(PG_FUNCTION_ARGS) /* Get the new function's pg_proc entry */ tuple = SearchSysCache1(PROCOID,ObjectIdGetDatum(funcoid)); if (!HeapTupleIsValid(tuple)) - elog(ERROR, "cache lookup failed for function %u", funcoid); + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_FUNCTION), + errmsg("function with OID %u does not exist", funcoid))); If you are making changes in plpgsql_validator(), then shouldn't we make changes in plperl_validator() or plpython_validator()? I see that you have made changes to function CheckFunctionValidatorAccess() which seems to be getting called from *_validator() (* indicates plpgsql/plperl/plpython) functions. Is there a reason for changing the *_validator() function? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 17, 2016 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Aug 10, 2016 at 11:27 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> On Wed, Aug 10, 2016 at 10:04 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> This seems better, after checking at other places I found that for >>> invalid type we are using ERRCODE_UNDEFINED_OBJECT and for invalid >>> functions we are using ERRCODE_UNDEFINED_FUNCTION. So I have done the >>> same way. >>> >>> Updated patch attached. >> >> I found some more places where we can get similar error and updated in >> my v3 patch. >> > > @@ -412,7 +412,9 @@ plpgsql_validator(PG_FUNCTION_ARGS) > /* Get the new function's pg_proc entry */ > tuple = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcoid)); > if (!HeapTupleIsValid(tuple)) > - elog(ERROR, "cache lookup failed for function %u", funcoid); > + ereport(ERROR, > + (errcode(ERRCODE_UNDEFINED_FUNCTION), > + errmsg("function with OID %u does not exist", funcoid))); > > If you are making changes in plpgsql_validator(), then shouldn't we > make changes in plperl_validator() or plpython_validator()? I see > that you have made changes to function CheckFunctionValidatorAccess() > which seems to be getting called from *_validator() (* indicates > plpgsql/plperl/plpython) functions. Is there a reason for changing > the *_validator() function? Yeah, when I glanced briefly at this patch, I found myself wondering whether all of these call sites were actually reachable (and why the patch contained no test cases to prove it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 17, 2016 at 6:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> If you are making changes in plpgsql_validator(), then shouldn't we >> make changes in plperl_validator() or plpython_validator()? I see >> that you have made changes to function CheckFunctionValidatorAccess() >> which seems to be getting called from *_validator() (* indicates >> plpgsql/plperl/plpython) functions. Is there a reason for changing >> the *_validator() function? Yes you are right, making changes in CheckFunctionValidatorAccess is sufficient, no need to make changes in *_validator (* indicates plpgsql/plperl/plpython) functions. I have changed this in attached patch.. > > Yeah, when I glanced briefly at this patch, I found myself wondering > whether all of these call sites were actually reachable (and why the > patch contained no test cases to prove it). I have also added test cases to cover all my changes. Basically this patch changes error at three places. 1. getBaseTypeAndTypmod: This is being called from domain_in exposed function (domain_in-> domain_state_setup-> getBaseTypeAndTypmod). Though this function is being called from many other places which are not exposed function, but I don't this will have any imapct, will this ? 2. lookup_type_cache: This is being called from record_in (record_in->lookup_rowtype_tupdesc-> lookup_rowtype_tupdesc_internal->lookup_type_cache). 3. CheckFunctionValidatorAccess: This is being called from all language validator functions. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
I have changed this in attached patch..
On Wed, Sep 7, 2016 at 8:52 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > I reviewed and tested the patch. The changes are fine. > This patch provides better error message compared to earlier. > > Marked the patch as "Ready for committer" in commit-fest. Thanks for the review ! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Dilip Kumar <dilipbalaut@gmail.com> writes: > Basically this patch changes error at three places. > 1. getBaseTypeAndTypmod: This is being called from domain_in exposed > function (domain_in-> > domain_state_setup-> getBaseTypeAndTypmod). Though this function is > being called from many other places which are not exposed function, > but I don't this will have any imapct, will this ? I really don't like this solution. Changing this one function, out of the dozens just like it in lsyscache.c, seems utterly random and arbitrary. I'm actually not especially convinced that passing domain_in a random OID needs to not produce an XX000 error. That isn't a user-facing function and people shouldn't be calling it by hand at all. The same goes for record_in. But if we do want those cases to avoid this, I think it's appropriate to fix it in those functions, not decree that essentially-random other parts of the system should bear the responsibility. (Thought experiment: if we changed the order of operations in domain_in so that getBaseTypeAndTypmod were no longer the first place to fail, would we undo this change and then change wherever the failure had moved to? That's pretty messy.) In the case of record_in, it seems like it'd be easy enough to use lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc() and then throw a user-facing error right there. Not sure about a nice solution for domain_in. We might have to resort to an extra syscache lookup there, but even if we did, it should happen only once per query so that's not very awful. > 3. CheckFunctionValidatorAccess: This is being called from all > language validator functions. This part seems reasonable, since the validator functions are documented as something users might call, and CheckFunctionValidatorAccess seems like an apropos place to handle it. regards, tom lane
On Thu, Sep 8, 2016 at 2:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I really don't like this solution. Changing this one function, out of the > dozens just like it in lsyscache.c, seems utterly random and arbitrary. > > I'm actually not especially convinced that passing domain_in a random > OID needs to not produce an XX000 error. That isn't a user-facing > function and people shouldn't be calling it by hand at all. The same > goes for record_in. But if we do want those cases to avoid this, > I think it's appropriate to fix it in those functions, not decree > that essentially-random other parts of the system should bear the > responsibility. (Thought experiment: if we changed the order of > operations in domain_in so that getBaseTypeAndTypmod were no longer > the first place to fail, would we undo this change and then change > wherever the failure had moved to? That's pretty messy.) > > In the case of record_in, it seems like it'd be easy enough to use > lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc() > and then throw a user-facing error right there. Actually when we are passing invalid type to "record_in", error is thrown in "lookup_type_cache" function, and this function is not taking "no_error" input (it's only limited to lookup_rowtype_tupdesc_internal). One option is to pass "no_error" parameter to "lookup_type_cache" function also, and throw error only in record_in function. But problem is, "lookup_type_cache" has lot of references. And also will it be good idea to throw one generic error instead of multiple specific errors. ? Another option is to do same for "record_in" also what you have suggested for domain_in (an extra syscache lookup). ? >Not sure about a > nice solution for domain_in. We might have to resort to an extra > syscache lookup there, but even if we did, it should happen only > once per query so that's not very awful. > >> 3. CheckFunctionValidatorAccess: This is being called from all >> language validator functions. > > This part seems reasonable, since the validator functions are documented > as something users might call, and CheckFunctionValidatorAccess seems > like an apropos place to handle it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Dilip Kumar <dilipbalaut@gmail.com> writes: > On Thu, Sep 8, 2016 at 2:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In the case of record_in, it seems like it'd be easy enough to use >> lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc() >> and then throw a user-facing error right there. > Actually when we are passing invalid type to "record_in", error is > thrown in "lookup_type_cache" function, and this function is not > taking "no_error" input (it's only limited to > lookup_rowtype_tupdesc_internal). Ah, should have dug a bit deeper. > One option is to pass "no_error" parameter to "lookup_type_cache" > function also, and throw error only in record_in function. > But problem is, "lookup_type_cache" has lot of references. And also > will it be good idea to throw one generic error instead of multiple > specific errors. ? We could make it work like that without breaking the ABI if we were to add a NOERROR bit to the allowed "flags". However, after looking around a bit I'm no longer convinced what I said above is a good idea. In particular, if we put the responsibility of reporting the error on callers then we'll lose the ability to distinguish "no such pg_type OID" from "type exists but it's only a shell". So I'm now thinking it's okay to promote lookup_type_cache's elog to a full ereport, especially since it's already using ereport for some of the related error cases such as the shell-type failure. That still leaves us with what to do for domain_in. A really simple fix would be to move the InitDomainConstraintRef call before the getBaseTypeAndTypmod call, because the former fetches the typcache entry and would then benefit from lookup_type_cache's ereport. But that seems a bit magic. I'm tempted to propose that domain_state_setup start with typentry = lookup_type_cache(domainType, TYPECACHE_DOMAIN_INFO); if (typentry->typtype != TYPTYPE_DOMAIN) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("type %s is not a domain", format_type_be(domainType)))); removing the need to verify that after getBaseTypeAndTypmod. The cache loading is basically free because InitDomainConstraintRef would do it anyway; so the extra cost here is only one dynahash search. You could imagine buying back those cycles by teaching the typcache to be able to cache the result of getBaseTypeAndTypmod, but I'm doubtful that we really care. This whole setup sequence only happens once per query anyway. regards, tom lane
On Thu, Sep 8, 2016 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We could make it work like that without breaking the ABI if we were > to add a NOERROR bit to the allowed "flags". However, after looking > around a bit I'm no longer convinced what I said above is a good idea. > In particular, if we put the responsibility of reporting the error on > callers then we'll lose the ability to distinguish "no such pg_type OID" > from "type exists but it's only a shell". So I'm now thinking it's okay > to promote lookup_type_cache's elog to a full ereport, especially since > it's already using ereport for some of the related error cases such as > the shell-type failure. Okay.. > > That still leaves us with what to do for domain_in. A really simple > fix would be to move the InitDomainConstraintRef call before the > getBaseTypeAndTypmod call, because the former fetches the typcache > entry and would then benefit from lookup_type_cache's ereport. > But that seems a bit magic. > > I'm tempted to propose that domain_state_setup start with > > typentry = lookup_type_cache(domainType, TYPECACHE_DOMAIN_INFO); > if (typentry->typtype != TYPTYPE_DOMAIN) > ereport(ERROR, > (errcode(ERRCODE_DATATYPE_MISMATCH), > errmsg("type %s is not a domain", > format_type_be(domainType)))); > > removing the need to verify that after getBaseTypeAndTypmod. Seems like a better option, done it this way.. attaching the modified patch.. > > The cache loading is basically free because InitDomainConstraintRef > would do it anyway; so the extra cost here is only one dynahash search. > You could imagine buying back those cycles by teaching the typcache > to be able to cache the result of getBaseTypeAndTypmod, but I'm doubtful > that we really care. This whole setup sequence only happens once per > query anyway. Agreed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Dilip Kumar <dilipbalaut@gmail.com> writes: > Seems like a better option, done it this way.. > attaching the modified patch.. Pushed with cosmetic adjustments --- mostly, I thought we needed some comments about the topic. regards, tom lane
On Fri, Sep 9, 2016 at 6:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Pushed with cosmetic adjustments --- mostly, I thought we needed some > comments about the topic. Okay, Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com