Thread: Explicit deterministic COLLATE fails with pattern matching operationson column with non-deterministic collation
Explicit deterministic COLLATE fails with pattern matching operationson column with non-deterministic collation
From
James Lucas
Date:
Hi all, Wanted to call out what seems like a possible bug in non-deterministic collation handling with pattern matching operators. Per the documentation, non-deterministic collations are not supported with pattern matching operators. Section 9.7 of the PG12 manual recommends "The pattern matching operators of all three kinds do not support nondeterministic collations. If required, apply a different collation to the expression to work around this limitation." However, I'm finding that pattern matching operations fail when a column is declared with a non-deterministic collation, *even if* a different, deterministic collation is explicitly applied to the pattern matching operation. This doesn't seem to be the expected behavior. Example. This is tested on Postgres 12.3, on Centos 8.1.1911 with libicu 60.3. Create a non-deterministic collation. create collation mycollation (provider = icu, locale = 'en-US-ks-level2.utf8', deterministic = false); Create a couple of sample tables: create table ctest (id numeric, t text); create table ctestnd (id numeric, t text collate mycollation); Populate them with some data: insert into ctest values (1,'aAa'); insert into ctest select generate_series(2,100000),'bbb'; insert into ctestnd select id, t from ctest; analyze ctest, ctestnd; Add a few indexes: create index ctest_idx01 on ctest (t); create index ctest_idx02 on ctest (t collate "C"); create index ctestnd_idx01 on ctestnd (t); create index ctestnd_idx02 on ctestnd (t collate "C"); Test on ctest: explain select * from ctest where t = 'aAa' collate "C"; QUERY PLAN -------------------------------------------------------------------------- Index Scan using ctest_idx02 on ctest (cost=0.42..4.44 rows=1 width=10) Index Cond: (t = 'aAa'::text COLLATE "C") COMMENT: Works as expected. explain select * from ctest where t like 'a%'; QUERY PLAN -------------------------------------------------------------------------- Index Scan using ctest_idx02 on ctest (cost=0.42..8.44 rows=1 width=10) Index Cond: ((t >= 'a'::text) AND (t < 'b'::text)) Filter: (t ~~ 'a%'::text) COMMENT: Actually this is very interesting, because even without an explicit COLLATE clause, LIKE still uses the "C" collation index. Not sure if that's intended behavior either? explain select * from ctest where t like 'a%' collate "C"; QUERY PLAN -------------------------------------------------------------------------- Index Scan using ctest_idx02 on ctest (cost=0.42..8.44 rows=1 width=10) Index Cond: ((t >= 'a'::text) AND (t < 'b'::text)) Filter: (t ~~ 'a%'::text COLLATE "C") COMMENT: Uses explicit collation and index as expected. Test on ctestnd: explain select * from ctestnd where t = 'aAa' collate "C"; QUERY PLAN ------------------------------------------------------------------------------ Index Scan using ctestnd_idx02 on ctestnd (cost=0.42..4.44 rows=1 width=10) Index Cond: (t = 'aAa'::text COLLATE "C") COMMENT: Works as expected. explain select * from ctestnd where t like 'a%'; ERROR: nondeterministic collations are not supported for LIKE COMMENT: Fails as expected. explain select * from ctestnd where t like 'a%' collate "C"; ERROR: nondeterministic collations are not supported for LIKE COMMENT: Not expected. It seems like the explicit COLLATE clause is ignored in this case. I've tried different placements for the COLLATE clause, and none seem to work. Is this a bug, or have I missed something? Thanks, James Lucas
Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation
From
"David G. Johnston"
Date:
On Wed, May 27, 2020 at 8:23 AM James Lucas <jlucasdba@gmail.com> wrote:
create table ctestnd (id numeric, t text collate mycollation);
create index ctestnd_idx02 on ctestnd (t collate "C");
Test on ctestnd:
explain select * from ctestnd where t = 'aAa' collate "C";
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using ctestnd_idx02 on ctestnd (cost=0.42..4.44 rows=1 width=10)
Index Cond: (t = 'aAa'::text COLLATE "C")
COMMENT: Works as expected.
Uses an index scan which is where the deterministic collation exists
explain select * from ctestnd where t like 'a%';
ERROR: nondeterministic collations are not supported for LIKE
COMMENT: Fails as expected.
explain select * from ctestnd where t like 'a%' collate "C";
ERROR: nondeterministic collations are not supported for LIKE
Your schema is inherently unstable in this respect because the planner has to be allowed to choose a sequential scan and as soon as it does it attempts to perform like comparisons with table data that is stored using a non-deterministic collation.
I don't know what kinds of promises we make about implicit collation manipulation here but absent such a transformation the sequential scan plan with LIKE generates an invalid plan choice. That it doesn't go find the index that happens to have a workable collation for the query is unsurprising - whether that is even a possibility is beyond me.
David J.
Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation
From
James Lucas
Date:
Hi David, Thanks for the response. One possibly relevant thing I forgot to mention. The collation for the database is "en_US.UTF-8", which is thus also the collation for the t column of ctest. Per the documentation, it seems putting an implicit collation on the operation should work. Although the documentation is admittedly a little vague in this respect. I also found a mail thread in the list where Peter Eisentraut recommended syntax exactly like this (collate "C") to work around the inability to use pattern matching on non-deterministic collation columns. Unfortunately that thread trailed out without a response if it actually worked. Noticed something else a bit interesting. Perhaps removing indexes from the equation would also help: drop index ctestnd_idx01, ctestnd_idx02, ctest_idx01, ctest_idx02; explain select * from ctest where t like 'a%' collate "C"; QUERY PLAN --------------------------------------------------------- Seq Scan on ctest (cost=0.00..1791.00 rows=1 width=10) Filter: (t ~~ 'a%'::text COLLATE "C") COMMENT: Okay explain select * from ctest where t like 'a%' collate mycollation; QUERY PLAN --------------------------------------------------------- Seq Scan on ctest (cost=0.00..1791.00 rows=1 width=10) Filter: (t ~~ 'a%'::text COLLATE mycollation) COMMENT: Wait, that doesn't seem right. select * from ctest where t like 'a%' collate mycollation; ERROR: nondeterministic collations are not supported for LIKE COMMENT: So in this case, specifying an explicit non-deterministic collation with EXPLAIN, we get a plan. But when we actually go to execute, it fails. explain select * from ctestnd where t like 'a%' collate "C"; ERROR: nondeterministic collations are not supported for LIKE COMMENT: But in the inverse case, running explain on a column with a non-deterministic collation, but an explicit deterministic collation, we don't even get a plan with EXPLAIN. That seems inconsistent. Only conclusion I can reach is that it's failing a check at an earlier point in the process than in the other case. Thanks, James On Wed, May 27, 2020 at 10:53 AM David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Wed, May 27, 2020 at 8:23 AM James Lucas <jlucasdba@gmail.com> wrote: >> >> >> create table ctestnd (id numeric, t text collate mycollation); >> >> create index ctestnd_idx02 on ctestnd (t collate "C"); > > >> >> Test on ctestnd: >> explain select * from ctestnd where t = 'aAa' collate "C"; >> QUERY PLAN >> ------------------------------------------------------------------------------ >> Index Scan using ctestnd_idx02 on ctestnd (cost=0.42..4.44 rows=1 width=10) >> Index Cond: (t = 'aAa'::text COLLATE "C") >> COMMENT: Works as expected. > > > Uses an index scan which is where the deterministic collation exists > >> >> >> explain select * from ctestnd where t like 'a%'; >> ERROR: nondeterministic collations are not supported for LIKE >> COMMENT: Fails as expected. >> >> explain select * from ctestnd where t like 'a%' collate "C"; >> ERROR: nondeterministic collations are not supported for LIKE >> > > Your schema is inherently unstable in this respect because the planner has to be allowed to choose a sequential scan andas soon as it does it attempts to perform like comparisons with table data that is stored using a non-deterministic collation. > > I don't know what kinds of promises we make about implicit collation manipulation here but absent such a transformationthe sequential scan plan with LIKE generates an invalid plan choice. That it doesn't go find the index thathappens to have a workable collation for the query is unsurprising - whether that is even a possibility is beyond me. > > David J. >
Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
From
Tom Lane
Date:
James Lucas <jlucasdba@gmail.com> writes: > explain select * from ctestnd where t like 'a%' collate "C"; > ERROR: nondeterministic collations are not supported for LIKE Yeah. I traced through this, and the place where it's failing is where the planner tries to apply the LIKE operator to the stored MCV values (to see how many of them pass the condition, which gives us a big clue about the selectivity). Unfortunately, per the comments in selfuncs.c, * For both oprrest and oprjoin functions, the operator's input collation OID * (if any) is passed using the standard fmgr mechanism, so that the estimator * function can fetch it with PG_GET_COLLATION(). Note, however, that all * statistics in pg_statistic are currently built using the relevant column's * collation. Thus, in most cases where we are looking at statistics, we * should ignore the operator collation and use the stats entry's collation. * We expect that the error induced by doing this is usually not large enough * to justify complicating matters. In any case, doing otherwise would yield * entirely garbage results for ordered stats data such as histograms. mcv_selectivity is following this advice and applying LIKE with the ctestnd.t column's declared collation ... and then the operator throws an error. The idea that using the "wrong" collation might actually cause an error was not factored into this design, obviously. I'm not sure offhand what to do about it. If we go over to using the query's collation then we avoid that issue, but instead we have the problem noted in this comment about the histogram sort order not matching what the operator expects. (In the case of mcv_selectivity the sort order isn't really an issue, but it is an issue for sibling functions such as ineq_histogram_selectivity.) This issue only dates back to commit 5e0928005; before that, we just blindly passed DEFAULT_COLLATION_OID to operators being evaluated for estimation purposes. (I suppose if you made the database's default collation nondeterministic, you could still get into trouble; but that case may not be reachable right now.) On the other hand, the actual breakage is even newer, because nondeterministic collations weren't added until 5e1963fb7, several months later. Both of those are v12 cycle, so it's academic from a user's standpoint which one we blame; but the upshot is that this case doesn't work. Ideally, no operator would ever throw an error about unsupported collations, but I suppose that day is far away. I guess the path of least resistance is to change the selectivity functions to use the query's collation; then, if you get an error here you would have done so at runtime anyway. The problem of inconsistency with the histogram collation will be real for ineq_histogram_selectivity; but we had a variant of that before, in that always using DEFAULT_COLLATION_OID would give answers that were wrong for a query using a different collation. Peter, any other thoughts? regards, tom lane
Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation
From
James Lucas
Date:
Thanks Tom, This is too much into the guts of the planner for me to contribute much. This does raise an interesting point that I had not considered though - it sounds like column statistics depend on the default collation on the column. That could impact the plans chosen for future queries, even if those queries are performed using a different collation. For most deterministic collations I expect that probably doesn't make much of a difference, but for non-deterministic collations it seems like the difference in stats could be significant. N_distinct, in particular, seems like it might be very different for a non-deterministic collation. I tried setting up a pathological test case for this, and it seems like at least currently, even with a non-deterministic collation statistics still count values as distinct, even if the default collation would consider them equivalent. Not sure if that's as intended or not? create table stest (id numeric, t text); create table stestnd (id numeric, t text collate mycollation); insert into stest select generate_series(1,50000),'aaa'; insert into stest select generate_series(50001,100000),'aAa'; insert into stest select generate_series(100001,150000),'bbb'; insert into stest select generate_series(150001,200000),'bBb'; insert into stest select generate_series(200001,250000),'ccc'; insert into stest select generate_series(250001,300000),'cCc'; insert into stestnd select * from stest; analyze stest, stestnd; select schemaname, tablename, attname, n_distinct, most_common_vals from pg_stats where attname='t' and tablename like 'stest%' order by tablename; schemaname | tablename | attname | n_distinct | most_common_vals ------------+-----------+---------+------------+--------------------------- public | stest | t | 6 | {aAa,cCc,bbb,aaa,ccc,bBb} public | stestnd | t | 6 | {bBb,ccc,bbb,aAa,cCc,aaa} Actually it turns out the DISTINCT clause doesn't either: select count(*) from (select distinct t from stest) s; count ------- 6 select count(*) from (select distinct t from stestnd) s; count ------- 6 Sorry - don't want to derail the question at hand too much. It seems like it might be relevant if the discussion is around stats and collation handling. Thanks, James On Wed, May 27, 2020 at 7:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > James Lucas <jlucasdba@gmail.com> writes: > > explain select * from ctestnd where t like 'a%' collate "C"; > > ERROR: nondeterministic collations are not supported for LIKE > > Yeah. I traced through this, and the place where it's failing is where > the planner tries to apply the LIKE operator to the stored MCV values > (to see how many of them pass the condition, which gives us a big clue > about the selectivity). Unfortunately, per the comments in selfuncs.c, > > * For both oprrest and oprjoin functions, the operator's input collation OID > * (if any) is passed using the standard fmgr mechanism, so that the estimator > * function can fetch it with PG_GET_COLLATION(). Note, however, that all > * statistics in pg_statistic are currently built using the relevant column's > * collation. Thus, in most cases where we are looking at statistics, we > * should ignore the operator collation and use the stats entry's collation. > * We expect that the error induced by doing this is usually not large enough > * to justify complicating matters. In any case, doing otherwise would yield > * entirely garbage results for ordered stats data such as histograms. > > mcv_selectivity is following this advice and applying LIKE with the > ctestnd.t column's declared collation ... and then the operator throws > an error. > > The idea that using the "wrong" collation might actually cause an error > was not factored into this design, obviously. I'm not sure offhand what > to do about it. If we go over to using the query's collation then we > avoid that issue, but instead we have the problem noted in this comment > about the histogram sort order not matching what the operator expects. > (In the case of mcv_selectivity the sort order isn't really an issue, > but it is an issue for sibling functions such as > ineq_histogram_selectivity.) > > This issue only dates back to commit 5e0928005; before that, we just > blindly passed DEFAULT_COLLATION_OID to operators being evaluated for > estimation purposes. (I suppose if you made the database's default > collation nondeterministic, you could still get into trouble; but that > case may not be reachable right now.) On the other hand, the actual > breakage is even newer, because nondeterministic collations weren't > added until 5e1963fb7, several months later. Both of those are v12 > cycle, so it's academic from a user's standpoint which one we blame; > but the upshot is that this case doesn't work. > > Ideally, no operator would ever throw an error about unsupported > collations, but I suppose that day is far away. > > I guess the path of least resistance is to change the selectivity > functions to use the query's collation; then, if you get an error > here you would have done so at runtime anyway. The problem of > inconsistency with the histogram collation will be real for > ineq_histogram_selectivity; but we had a variant of that before, > in that always using DEFAULT_COLLATION_OID would give answers > that were wrong for a query using a different collation. > > Peter, any other thoughts? > > regards, tom lane
Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
From
Tom Lane
Date:
James Lucas <jlucasdba@gmail.com> writes: > I tried setting up a pathological test case for this, and it seems > like at least currently, even with a non-deterministic collation > statistics still count values as distinct, even if the default > collation would consider them equivalent. Not sure if that's as > intended or not? I experimented with this, and what I'm seeing is that ucol_strcollUTF8() reports that 'aaa' is different from 'aAa'. So the behavior on the Postgres side is as-expected. I suspect that the 'en-US-ks-level2' ICU locale doesn't act as you think it does. (That is, just saying that a collation is nondeterministic doesn't make it so; it only forces Postgres through slower code paths that allow for the possibility of bitwise-unequal strings being reported as equal by ICU.) Not knowing anything about ICU, I can't say more than that. [ Tested on libicu-60.3-2.el8_1 ] regards, tom lane
Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation
From
"Daniel Verite"
Date:
Tom Lane wrote: > I suspect that the 'en-US-ks-level2' ICU locale doesn't act as you > think it does. Indeed, because the syntax is tricky. The OP wants 'en-US-u-ks-level2'. With 'en-US-ks-level2', the ks-level2 component is ignored and you get a tertiary colstrength. Or use 'en-US@colStrength=secondary' which is possibly more readable and works with older versions of ICU. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: Explicit deterministic COLLATE fails with pattern matchingoperations on column with non-deterministic collation
From
James Lucas
Date:
Apologies if anyone gets this twice. I got a rejected mail notice back the first time I sent. You are correct. I was playing around with collation naming sometime back and when I started looking at this, I just used one I had left in the database assuming it was correct. That's my bad. I dropped the tables and redefined the collation as create collation mycollation (provider = icu, locale = 'en-US-u-ks-level2', deterministic = false); Now the results are more what I expected. select schemaname, tablename, attname, n_distinct, most_common_vals from pg_stats where attname='t' and tablename like 'stest%' order by tablename; schemaname | tablename | attname | n_distinct | most_common_vals ------------+-----------+---------+------------+--------------------------- public | stest | t | 6 | {aaa,cCc,bBb,bbb,ccc,aAa} public | stestnd | t | 3 | {ccc,bbb,aaa} So that is something to be aware of - the collation defined on the column can impact stats values, which could in turn impact plans chosen for queries that use alternative collations. Sorry for the distraction. That still leaves us with the original issue regarding LIKE and COLLATE. Thanks, James On Thu, May 28, 2020 at 1:48 PM Daniel Verite <daniel@manitou-mail.org> wrote: > > Tom Lane wrote: > > > I suspect that the 'en-US-ks-level2' ICU locale doesn't act as you > > think it does. > > Indeed, because the syntax is tricky. The OP wants 'en-US-u-ks-level2'. > With 'en-US-ks-level2', the ks-level2 component is ignored and you > get a tertiary colstrength. > > Or use 'en-US@colStrength=secondary' which is possibly more > readable and works with older versions of ICU. > > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite
Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
From
Tom Lane
Date:
James Lucas <jlucasdba@gmail.com> writes: > So that is something to be aware of - the collation defined on the > column can impact stats values, which could in turn impact plans > chosen for queries that use alternative collations. Yeah. At some point we might try to collect stats with respect to multiple collations, but that's a long way off probably. (I have suggested that CREATE STATISTICS could be extended to control this type of thing, but I don't think anyone's worked on making it happen.) regards, tom lane
Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
From
Tom Lane
Date:
I wrote: > I guess the path of least resistance is to change the selectivity > functions to use the query's collation; then, if you get an error > here you would have done so at runtime anyway. The problem of > inconsistency with the histogram collation will be real for > ineq_histogram_selectivity; but we had a variant of that before, > in that always using DEFAULT_COLLATION_OID would give answers > that were wrong for a query using a different collation. I worked on this for awhile and came up with the attached patchset. 0001 does about the minimum required to avoid this failure, by passing the query's collation not stacoll to operators and selectivity functions invoked during selectivity estimation. Unfortunately, it doesn't seem like we could sanely back-patch this, because it requires adding parameters to several globally-visible functions. The odds that some external code is calling those functions seem too high to risk an ABI break. So, while I'd like to squeeze this into v13, we still need to think about what to do for v12. 0002 addresses the mentioned problem with ineq_histogram_selectivity by having that function actually verify that the query operator and collation match what the pg_statistic histogram was generated with. If they don't match, all is not lost. What we can do is just sequentially apply the query's operator and comparison constant to each histogram entry, and take the fraction of matches as our selectivity estimate. This is more or less the same insight we have used in generic_restriction_selectivity: the histogram is a pretty decent sample of the column, even if its ordering is not quite what you want. 0002 also deletes a hack I had put in get_attstatsslot() to insert a dummy value into sslot->stacoll. That hack isn't necessary any longer (because indeed we aren't using sslot->stacoll's value anywhere as of 0001), and it breaks the verification check that 0002 wants to add to ineq_histogram_selectivity, which depends on stacoll being truthful. I also adjusted get_variable_range() to deal with collations more honestly. When I went to test 0002, I found out that it broke some test cases in privileges.sql, and the reason was rather interesting. What those cases are relying on is getting a highly accurate selectivity estimate for a user-defined operator, for which the only thing the planner knows for sure is that it uses scalarltsel as the restriction estimator. Despite this lack of knowledge, the existing code just blithely uses the histogram as though it is *precisely* applicable to the user-defined operator. (Which it is, since that operator is just a wrapper around regular "<" ... but the system has no business assuming that.) So with the patch, the case exercises the new code path that just counts matches, and that gives us only 1/default_statistics_target resolution in the selectivity estimate; which is not enough to get the expected plan to be selected. I worked around this for the moment by cranking up default_statistics_target while running the ANALYZE in that test script, but I wonder if we should instead tweak those test cases to be more robust. I think the combination of 0001+0002 really moves the goalposts a long way in terms of having honest stats estimation for non-default collations, so I'd like to sneak it into v13. As for v12, about the only alternatives I can think of are: 1. Do nothing, reasoning that if nobody noticed for a year, this situation is enough of a corner case that we can leave it unfixed. Obviously that's pretty unsatisfying. 2. Change all the stats functions to pass DEFAULT_COLLATION_OID when invoking operator functions. This is not too attractive either because it essentially reverts 5e0928005; in fact, to avoid breaking things completely we'd likely have to revert the part of that commit that taught ANALYZE to collect stats using column collations instead of DEFAULT_COLLATION_OID. Then we get into questions like what about 6b0faf723 --- it's going to be a mess. 3. Hack things up so that the core code renames all these exposed functions to, say, ineq_histogram_selectivity_ext() and so on, allowing the additional arguments to exist, but the old names would still be there as ABI compatibility wrappers. This might produce slightly funny results for external code calling the wrappers, since the wrappers would have to assume DEFAULT_COLLATION_OID, but it'd avoid an ABI break at least. I don't want to propagate such a thing into HEAD, so this would leave us with unsightly differences between v12 and earlier/later branches -- but there aren't *that* many places involved. (I'd envision this approach as back-porting 0001 but not 0002. For one reason, there's noplace for a wrapper to get the additional operator OID needed for ineq_histogram_selectivity_ext. For another, the results for the privilege test suggest that 0002 might have surprising effects on user-defined operators, so back patching it might draw more complaints.) Alternatives #2 and #3 would result in (different) changes in the selectivity estimates v12 produces when considering columns with non-default collations and/or queries using collations that don't match the relevant columns. So that might be an argument for doing nothing in v12; people tend not to like it when minor releases cause unexpected plan changes. Also, #2 is probably strictly worse than #3 on this score, since it'd move such estimates away from reality not towards it. Thoughts? regards, tom lane diff --git a/contrib/ltree/ltree_op.c b/contrib/ltree/ltree_op.c index 4ac2ed5e54..778dbf1e98 100644 --- a/contrib/ltree/ltree_op.c +++ b/contrib/ltree/ltree_op.c @@ -582,7 +582,7 @@ ltreeparentsel(PG_FUNCTION_ARGS) double selec; /* Use generic restriction selectivity logic, with default 0.001. */ - selec = generic_restriction_selectivity(root, operator, + selec = generic_restriction_selectivity(root, operator, InvalidOid, args, varRelid, 0.001); diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c index 286e000d4e..ae5c8f084e 100644 --- a/src/backend/utils/adt/like_support.c +++ b/src/backend/utils/adt/like_support.c @@ -92,6 +92,7 @@ static Pattern_Prefix_Status pattern_fixed_prefix(Const *patt, static Selectivity prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, Oid eqopr, Oid ltopr, Oid geopr, + Oid collation, Const *prefixcon); static Selectivity like_selectivity(const char *patt, int pattlen, bool case_insensitive); @@ -534,12 +535,6 @@ patternsel_common(PlannerInfo *root, * something binary-compatible but different.) We can use it to identify * the comparison operators and the required type of the comparison * constant, much as in match_pattern_prefix(). - * - * NOTE: this logic does not consider collations. Ideally we'd force use - * of "C" collation, but since ANALYZE only generates statistics for the - * column's specified collation, we have little choice but to use those. - * But our results are so approximate anyway that it probably hardly - * matters. */ vartype = vardata.vartype; @@ -622,7 +617,7 @@ patternsel_common(PlannerInfo *root, /* * Pattern specifies an exact match, so estimate as for '=' */ - result = var_eq_const(&vardata, eqopr, prefix->constvalue, + result = var_eq_const(&vardata, eqopr, collation, prefix->constvalue, false, true, false); } else @@ -654,7 +649,8 @@ patternsel_common(PlannerInfo *root, opfuncid = get_opcode(oprid); fmgr_info(opfuncid, &opproc); - selec = histogram_selectivity(&vardata, &opproc, constval, true, + selec = histogram_selectivity(&vardata, &opproc, collation, + constval, true, 10, 1, &hist_size); /* If not at least 100 entries, use the heuristic method */ @@ -666,6 +662,7 @@ patternsel_common(PlannerInfo *root, if (pstatus == Pattern_Prefix_Partial) prefixsel = prefix_selectivity(root, &vardata, eqopr, ltopr, geopr, + collation, prefix); else prefixsel = 1.0; @@ -698,7 +695,8 @@ patternsel_common(PlannerInfo *root, * directly to the result selectivity. Also add up the total fraction * represented by MCV entries. */ - mcv_selec = mcv_selectivity(&vardata, &opproc, constval, true, + mcv_selec = mcv_selectivity(&vardata, &opproc, collation, + constval, true, &sumcommon); /* @@ -1196,7 +1194,7 @@ pattern_fixed_prefix(Const *patt, Pattern_Type ptype, Oid collation, * population represented by the histogram --- the caller must fold this * together with info about MCVs and NULLs. * - * We use the specified btree comparison operators to do the estimation. + * We use the given comparison operators and collation to do the estimation. * The given variable and Const must be of the associated datatype(s). * * XXX Note: we make use of the upper bound to estimate operator selectivity @@ -1207,11 +1205,11 @@ pattern_fixed_prefix(Const *patt, Pattern_Type ptype, Oid collation, static Selectivity prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, Oid eqopr, Oid ltopr, Oid geopr, + Oid collation, Const *prefixcon) { Selectivity prefixsel; FmgrInfo opproc; - AttStatsSlot sslot; Const *greaterstrcon; Selectivity eq_sel; @@ -1220,6 +1218,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, prefixsel = ineq_histogram_selectivity(root, vardata, &opproc, true, true, + collation, prefixcon->constvalue, prefixcon->consttype); @@ -1229,27 +1228,18 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, return DEFAULT_MATCH_SEL; } - /*------- - * If we can create a string larger than the prefix, say - * "x < greaterstr". We try to generate the string referencing the - * collation of the var's statistics, but if that's not available, - * use DEFAULT_COLLATION_OID. - *------- + /* + * If we can create a string larger than the prefix, say "x < greaterstr". */ - if (HeapTupleIsValid(vardata->statsTuple) && - get_attstatsslot(&sslot, vardata->statsTuple, - STATISTIC_KIND_HISTOGRAM, InvalidOid, 0)) - /* sslot.stacoll is set up */ ; - else - sslot.stacoll = DEFAULT_COLLATION_OID; fmgr_info(get_opcode(ltopr), &opproc); - greaterstrcon = make_greater_string(prefixcon, &opproc, sslot.stacoll); + greaterstrcon = make_greater_string(prefixcon, &opproc, collation); if (greaterstrcon) { Selectivity topsel; topsel = ineq_histogram_selectivity(root, vardata, &opproc, false, false, + collation, greaterstrcon->constvalue, greaterstrcon->consttype); @@ -1278,7 +1268,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, * probably off the end of the histogram, and thus we probably got a very * small estimate from the >= condition; so we still need to clamp. */ - eq_sel = var_eq_const(vardata, eqopr, prefixcon->constvalue, + eq_sel = var_eq_const(vardata, eqopr, collation, prefixcon->constvalue, false, true, false); prefixsel = Max(prefixsel, eq_sel); diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c index 863efd3d76..955e0ee87f 100644 --- a/src/backend/utils/adt/network_selfuncs.c +++ b/src/backend/utils/adt/network_selfuncs.c @@ -137,7 +137,8 @@ networksel(PG_FUNCTION_ARGS) * by MCV entries. */ fmgr_info(get_opcode(operator), &proc); - mcv_selec = mcv_selectivity(&vardata, &proc, constvalue, varonleft, + mcv_selec = mcv_selectivity(&vardata, &proc, InvalidOid, + constvalue, varonleft, &sumcommon); /* diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index cfb05682bc..2332277307 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -88,11 +88,7 @@ * (if any) is passed using the standard fmgr mechanism, so that the estimator * function can fetch it with PG_GET_COLLATION(). Note, however, that all * statistics in pg_statistic are currently built using the relevant column's - * collation. Thus, in most cases where we are looking at statistics, we - * should ignore the operator collation and use the stats entry's collation. - * We expect that the error induced by doing this is usually not large enough - * to justify complicating matters. In any case, doing otherwise would yield - * entirely garbage results for ordered stats data such as histograms. + * collation. *---------- */ @@ -149,14 +145,14 @@ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); -static double eqjoinsel_inner(Oid opfuncoid, +static double eqjoinsel_inner(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, AttStatsSlot *sslot1, AttStatsSlot *sslot2, Form_pg_statistic stats1, Form_pg_statistic stats2, bool have_mcvs1, bool have_mcvs2); -static double eqjoinsel_semi(Oid opfuncoid, +static double eqjoinsel_semi(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, @@ -194,10 +190,11 @@ static double convert_timevalue_to_scalar(Datum value, Oid typid, static void examine_simple_variable(PlannerInfo *root, Var *var, VariableStatData *vardata); static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, - Oid sortop, Datum *min, Datum *max); + Oid sortop, Oid collation, + Datum *min, Datum *max); static bool get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, - Oid sortop, + Oid sortop, Oid collation, Datum *min, Datum *max); static bool get_actual_variable_endpoint(Relation heapRel, Relation indexRel, @@ -235,6 +232,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) Oid operator = PG_GETARG_OID(1); List *args = (List *) PG_GETARG_POINTER(2); int varRelid = PG_GETARG_INT32(3); + Oid collation = PG_GET_COLLATION(); VariableStatData vardata; Node *other; bool varonleft; @@ -268,12 +266,12 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) * in the query.) */ if (IsA(other, Const)) - selec = var_eq_const(&vardata, operator, + selec = var_eq_const(&vardata, operator, collation, ((Const *) other)->constvalue, ((Const *) other)->constisnull, varonleft, negate); else - selec = var_eq_non_const(&vardata, operator, other, + selec = var_eq_non_const(&vardata, operator, collation, other, varonleft, negate); ReleaseVariableStats(vardata); @@ -287,7 +285,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) * This is exported so that some other estimation functions can use it. */ double -var_eq_const(VariableStatData *vardata, Oid operator, +var_eq_const(VariableStatData *vardata, Oid operator, Oid collation, Datum constval, bool constisnull, bool varonleft, bool negate) { @@ -356,7 +354,7 @@ var_eq_const(VariableStatData *vardata, Oid operator, * eqproc returns NULL, though really equality functions should * never do that. */ - InitFunctionCallInfoData(*fcinfo, &eqproc, 2, sslot.stacoll, + InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation, NULL, NULL); fcinfo->args[0].isnull = false; fcinfo->args[1].isnull = false; @@ -458,7 +456,7 @@ var_eq_const(VariableStatData *vardata, Oid operator, * This is exported so that some other estimation functions can use it. */ double -var_eq_non_const(VariableStatData *vardata, Oid operator, +var_eq_non_const(VariableStatData *vardata, Oid operator, Oid collation, Node *other, bool varonleft, bool negate) { @@ -573,6 +571,7 @@ neqsel(PG_FUNCTION_ARGS) */ static double scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, + Oid collation, VariableStatData *vardata, Datum constval, Oid consttype) { Form_pg_statistic stats; @@ -672,7 +671,7 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, * to the result selectivity. Also add up the total fraction represented * by MCV entries. */ - mcv_selec = mcv_selectivity(vardata, &opproc, constval, true, + mcv_selec = mcv_selectivity(vardata, &opproc, collation, constval, true, &sumcommon); /* @@ -681,6 +680,7 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, */ hist_selec = ineq_histogram_selectivity(root, vardata, &opproc, isgt, iseq, + collation, constval, consttype); /* @@ -722,7 +722,7 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, * if there is no MCV list. */ double -mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, +mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, Oid collation, Datum constval, bool varonleft, double *sumcommonp) { @@ -749,7 +749,7 @@ mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, * operators that can return NULL. A small side benefit is to not * need to re-initialize the fcinfo struct from scratch each time. */ - InitFunctionCallInfoData(*fcinfo, opproc, 2, sslot.stacoll, + InitFunctionCallInfoData(*fcinfo, opproc, 2, collation, NULL, NULL); fcinfo->args[0].isnull = false; fcinfo->args[1].isnull = false; @@ -813,7 +813,8 @@ mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, * prudent to clamp the result range, ie, disbelieve exact 0 or 1 outputs. */ double -histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, +histogram_selectivity(VariableStatData *vardata, + FmgrInfo *opproc, Oid collation, Datum constval, bool varonleft, int min_hist_size, int n_skip, int *hist_size) @@ -846,7 +847,7 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, * is to not need to re-initialize the fcinfo struct from scratch * each time. */ - InitFunctionCallInfoData(*fcinfo, opproc, 2, sslot.stacoll, + InitFunctionCallInfoData(*fcinfo, opproc, 2, collation, NULL, NULL); fcinfo->args[0].isnull = false; fcinfo->args[1].isnull = false; @@ -903,7 +904,7 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, * Otherwise, fall back to the default selectivity provided by the caller. */ double -generic_restriction_selectivity(PlannerInfo *root, Oid oproid, +generic_restriction_selectivity(PlannerInfo *root, Oid oproid, Oid collation, List *args, int varRelid, double default_selectivity) { @@ -946,7 +947,8 @@ generic_restriction_selectivity(PlannerInfo *root, Oid oproid, /* * Calculate the selectivity for the column's most common values. */ - mcvsel = mcv_selectivity(&vardata, &opproc, constval, varonleft, + mcvsel = mcv_selectivity(&vardata, &opproc, collation, + constval, varonleft, &mcvsum); /* @@ -955,7 +957,7 @@ generic_restriction_selectivity(PlannerInfo *root, Oid oproid, * population. Otherwise use the default selectivity for the non-MCV * population. */ - selec = histogram_selectivity(&vardata, &opproc, + selec = histogram_selectivity(&vardata, &opproc, collation, constval, varonleft, 10, 1, &hist_size); if (selec < 0) @@ -1029,6 +1031,7 @@ double ineq_histogram_selectivity(PlannerInfo *root, VariableStatData *vardata, FmgrInfo *opproc, bool isgt, bool iseq, + Oid collation, Datum constval, Oid consttype) { double hist_selec; @@ -1042,9 +1045,11 @@ ineq_histogram_selectivity(PlannerInfo *root, * column type. However, to make that work we will need to figure out * which staop to search for --- it's not necessarily the one we have at * hand! (For example, we might have a '<=' operator rather than the '<' - * operator that will appear in staop.) For now, assume that whatever - * appears in pg_statistic is sorted the same way our operator sorts, or - * the reverse way if isgt is true. + * operator that will appear in staop.) The collation might not agree + * either. For now, just assume that whatever appears in pg_statistic is + * sorted the same way our operator sorts, or the reverse way if isgt is + * true. This could result in a bogus estimate, but it still seems better + * than falling back to the default estimate. */ if (HeapTupleIsValid(vardata->statsTuple) && statistic_proc_security_check(vardata, opproc->fn_oid) && @@ -1090,6 +1095,7 @@ ineq_histogram_selectivity(PlannerInfo *root, have_end = get_actual_variable_range(root, vardata, sslot.staop, + collation, &sslot.values[0], &sslot.values[1]); @@ -1107,17 +1113,19 @@ ineq_histogram_selectivity(PlannerInfo *root, have_end = get_actual_variable_range(root, vardata, sslot.staop, + collation, &sslot.values[0], NULL); else if (probe == sslot.nvalues - 1 && sslot.nvalues > 2) have_end = get_actual_variable_range(root, vardata, sslot.staop, + collation, NULL, &sslot.values[probe]); ltcmp = DatumGetBool(FunctionCall2Coll(opproc, - sslot.stacoll, + collation, sslot.values[probe], constval)); if (isgt) @@ -1202,7 +1210,7 @@ ineq_histogram_selectivity(PlannerInfo *root, * values to a uniform comparison scale, and do a linear * interpolation within this bin. */ - if (convert_to_scalar(constval, consttype, sslot.stacoll, + if (convert_to_scalar(constval, consttype, collation, &val, sslot.values[i - 1], sslot.values[i], vardata->vartype, @@ -1342,6 +1350,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq) Oid operator = PG_GETARG_OID(1); List *args = (List *) PG_GETARG_POINTER(2); int varRelid = PG_GETARG_INT32(3); + Oid collation = PG_GET_COLLATION(); VariableStatData vardata; Node *other; bool varonleft; @@ -1394,7 +1403,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq) } /* The rest of the work is done by scalarineqsel(). */ - selec = scalarineqsel(root, operator, isgt, iseq, + selec = scalarineqsel(root, operator, isgt, iseq, collation, &vardata, constval, consttype); ReleaseVariableStats(vardata); @@ -1459,7 +1468,7 @@ boolvarsel(PlannerInfo *root, Node *arg, int varRelid) * A boolean variable V is equivalent to the clause V = 't', so we * compute the selectivity as if that is what we have. */ - selec = var_eq_const(&vardata, BooleanEqualOperator, + selec = var_eq_const(&vardata, BooleanEqualOperator, InvalidOid, BoolGetDatum(true), false, true, false); } else @@ -2185,6 +2194,7 @@ eqjoinsel(PG_FUNCTION_ARGS) JoinType jointype = (JoinType) PG_GETARG_INT16(3); #endif SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4); + Oid collation = PG_GET_COLLATION(); double selec; double selec_inner; VariableStatData vardata1; @@ -2235,7 +2245,7 @@ eqjoinsel(PG_FUNCTION_ARGS) } /* We need to compute the inner-join selectivity in all cases */ - selec_inner = eqjoinsel_inner(opfuncoid, + selec_inner = eqjoinsel_inner(opfuncoid, collation, &vardata1, &vardata2, nd1, nd2, isdefault1, isdefault2, @@ -2262,7 +2272,7 @@ eqjoinsel(PG_FUNCTION_ARGS) inner_rel = find_join_input_rel(root, sjinfo->min_righthand); if (!join_is_reversed) - selec = eqjoinsel_semi(opfuncoid, + selec = eqjoinsel_semi(opfuncoid, collation, &vardata1, &vardata2, nd1, nd2, isdefault1, isdefault2, @@ -2275,7 +2285,7 @@ eqjoinsel(PG_FUNCTION_ARGS) Oid commop = get_commutator(operator); Oid commopfuncoid = OidIsValid(commop) ? get_opcode(commop) : InvalidOid; - selec = eqjoinsel_semi(commopfuncoid, + selec = eqjoinsel_semi(commopfuncoid, collation, &vardata2, &vardata1, nd2, nd1, isdefault2, isdefault1, @@ -2323,7 +2333,7 @@ eqjoinsel(PG_FUNCTION_ARGS) * that it's worth trying to distinguish them here. */ static double -eqjoinsel_inner(Oid opfuncoid, +eqjoinsel_inner(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, @@ -2373,7 +2383,7 @@ eqjoinsel_inner(Oid opfuncoid, * returns NULL, though really equality functions should never do * that. */ - InitFunctionCallInfoData(*fcinfo, &eqproc, 2, sslot1->stacoll, + InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation, NULL, NULL); fcinfo->args[0].isnull = false; fcinfo->args[1].isnull = false; @@ -2520,7 +2530,7 @@ eqjoinsel_inner(Oid opfuncoid, * Unlike eqjoinsel_inner, we have to cope with opfuncoid being InvalidOid. */ static double -eqjoinsel_semi(Oid opfuncoid, +eqjoinsel_semi(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, @@ -2603,7 +2613,7 @@ eqjoinsel_semi(Oid opfuncoid, * returns NULL, though really equality functions should never do * that. */ - InitFunctionCallInfoData(*fcinfo, &eqproc, 2, sslot1->stacoll, + InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation, NULL, NULL); fcinfo->args[0].isnull = false; fcinfo->args[1].isnull = false; @@ -2851,6 +2861,7 @@ mergejoinscansel(PlannerInfo *root, Node *clause, Oid op_lefttype; Oid op_righttype; Oid opno, + collation, lsortop, rsortop, lstatop, @@ -2875,6 +2886,7 @@ mergejoinscansel(PlannerInfo *root, Node *clause, if (!is_opclause(clause)) return; /* shouldn't happen */ opno = ((OpExpr *) clause)->opno; + collation = ((OpExpr *) clause)->inputcollid; left = get_leftop((Expr *) clause); right = get_rightop((Expr *) clause); if (!right) @@ -3008,20 +3020,20 @@ mergejoinscansel(PlannerInfo *root, Node *clause, /* Try to get ranges of both inputs */ if (!isgt) { - if (!get_variable_range(root, &leftvar, lstatop, + if (!get_variable_range(root, &leftvar, lstatop, collation, &leftmin, &leftmax)) goto fail; /* no range available from stats */ - if (!get_variable_range(root, &rightvar, rstatop, + if (!get_variable_range(root, &rightvar, rstatop, collation, &rightmin, &rightmax)) goto fail; /* no range available from stats */ } else { /* need to swap the max and min */ - if (!get_variable_range(root, &leftvar, lstatop, + if (!get_variable_range(root, &leftvar, lstatop, collation, &leftmax, &leftmin)) goto fail; /* no range available from stats */ - if (!get_variable_range(root, &rightvar, rstatop, + if (!get_variable_range(root, &rightvar, rstatop, collation, &rightmax, &rightmin)) goto fail; /* no range available from stats */ } @@ -3031,13 +3043,13 @@ mergejoinscansel(PlannerInfo *root, Node *clause, * fraction that's <= the right-side maximum value. But only believe * non-default estimates, else stick with our 1.0. */ - selec = scalarineqsel(root, leop, isgt, true, &leftvar, + selec = scalarineqsel(root, leop, isgt, true, collation, &leftvar, rightmax, op_righttype); if (selec != DEFAULT_INEQ_SEL) *leftend = selec; /* And similarly for the right variable. */ - selec = scalarineqsel(root, revleop, isgt, true, &rightvar, + selec = scalarineqsel(root, revleop, isgt, true, collation, &rightvar, leftmax, op_lefttype); if (selec != DEFAULT_INEQ_SEL) *rightend = selec; @@ -3061,13 +3073,13 @@ mergejoinscansel(PlannerInfo *root, Node *clause, * minimum value. But only believe non-default estimates, else stick with * our own default. */ - selec = scalarineqsel(root, ltop, isgt, false, &leftvar, + selec = scalarineqsel(root, ltop, isgt, false, collation, &leftvar, rightmin, op_righttype); if (selec != DEFAULT_INEQ_SEL) *leftstart = selec; /* And similarly for the right variable. */ - selec = scalarineqsel(root, revltop, isgt, false, &rightvar, + selec = scalarineqsel(root, revltop, isgt, false, collation, &rightvar, leftmin, op_lefttype); if (selec != DEFAULT_INEQ_SEL) *rightstart = selec; @@ -3147,10 +3159,11 @@ matchingsel(PG_FUNCTION_ARGS) Oid operator = PG_GETARG_OID(1); List *args = (List *) PG_GETARG_POINTER(2); int varRelid = PG_GETARG_INT32(3); + Oid collation = PG_GET_COLLATION(); double selec; /* Use generic restriction selectivity logic. */ - selec = generic_restriction_selectivity(root, operator, + selec = generic_restriction_selectivity(root, operator, collation, args, varRelid, DEFAULT_MATCHING_SEL); @@ -5337,9 +5350,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault) * * sortop is the "<" comparison operator to use. This should generally * be "<" not ">", as only the former is likely to be found in pg_statistic. + * The collation must be specified too. */ static bool -get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, +get_variable_range(PlannerInfo *root, VariableStatData *vardata, + Oid sortop, Oid collation, Datum *min, Datum *max) { Datum tmin = 0; @@ -5359,7 +5374,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, * before enabling this. */ #ifdef NOT_USED - if (get_actual_variable_range(root, vardata, sortop, min, max)) + if (get_actual_variable_range(root, vardata, sortop, collation, min, max)) return true; #endif @@ -5387,7 +5402,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, * * If there is a histogram that is sorted with some other operator than * the one we want, fail --- this suggests that there is data we can't - * use. + * use. XXX consider collation too. */ if (get_attstatsslot(&sslot, vardata->statsTuple, STATISTIC_KIND_HISTOGRAM, sortop, @@ -5434,14 +5449,14 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, continue; } if (DatumGetBool(FunctionCall2Coll(&opproc, - sslot.stacoll, + collation, sslot.values[i], tmin))) { tmin = sslot.values[i]; tmin_is_mcv = true; } if (DatumGetBool(FunctionCall2Coll(&opproc, - sslot.stacoll, + collation, tmax, sslot.values[i]))) { tmax = sslot.values[i]; @@ -5471,10 +5486,11 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, * If no data available, return false. * * sortop is the "<" comparison operator to use. + * collation is the required collation. */ static bool get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, - Oid sortop, + Oid sortop, Oid collation, Datum *min, Datum *max) { bool have_data = false; @@ -5514,9 +5530,11 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, continue; /* - * The first index column must match the desired variable and sort - * operator --- but we can use a descending-order index. + * The first index column must match the desired variable, sortop, and + * collation --- but we can use a descending-order index. */ + if (collation != index->indexcollations[0]) + continue; /* test first 'cause it's cheapest */ if (!match_index_to_operand(vardata->var, 0, index)) continue; switch (get_op_opfamily_strategy(sortop, index->sortopfamily[0])) diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index 9690b4e486..15d2289024 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -144,24 +144,30 @@ extern void get_join_variables(PlannerInfo *root, List *args, bool *join_is_reversed); extern double get_variable_numdistinct(VariableStatData *vardata, bool *isdefault); -extern double mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, +extern double mcv_selectivity(VariableStatData *vardata, + FmgrInfo *opproc, Oid collation, Datum constval, bool varonleft, double *sumcommonp); -extern double histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, +extern double histogram_selectivity(VariableStatData *vardata, + FmgrInfo *opproc, Oid collation, Datum constval, bool varonleft, int min_hist_size, int n_skip, int *hist_size); -extern double generic_restriction_selectivity(PlannerInfo *root, Oid oproid, +extern double generic_restriction_selectivity(PlannerInfo *root, + Oid oproid, Oid collation, List *args, int varRelid, double default_selectivity); extern double ineq_histogram_selectivity(PlannerInfo *root, VariableStatData *vardata, FmgrInfo *opproc, bool isgt, bool iseq, + Oid collation, Datum constval, Oid consttype); -extern double var_eq_const(VariableStatData *vardata, Oid oproid, +extern double var_eq_const(VariableStatData *vardata, + Oid oproid, Oid collation, Datum constval, bool constisnull, bool varonleft, bool negate); -extern double var_eq_non_const(VariableStatData *vardata, Oid oproid, +extern double var_eq_non_const(VariableStatData *vardata, + Oid oproid, Oid collation, Node *other, bool varonleft, bool negate); diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c index ae5c8f084e..bcfbaa1c3d 100644 --- a/src/backend/utils/adt/like_support.c +++ b/src/backend/utils/adt/like_support.c @@ -1217,7 +1217,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, fmgr_info(get_opcode(geopr), &opproc); prefixsel = ineq_histogram_selectivity(root, vardata, - &opproc, true, true, + geopr, &opproc, true, true, collation, prefixcon->constvalue, prefixcon->consttype); @@ -1238,7 +1238,7 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, Selectivity topsel; topsel = ineq_histogram_selectivity(root, vardata, - &opproc, false, false, + ltopr, &opproc, false, false, collation, greaterstrcon->constvalue, greaterstrcon->consttype); diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 2332277307..208744cd3a 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -192,6 +192,10 @@ static void examine_simple_variable(PlannerInfo *root, Var *var, static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, Oid collation, Datum *min, Datum *max); +static void get_stats_slot_range(AttStatsSlot *sslot, + Oid opfuncoid, FmgrInfo *opproc, + Oid collation, int16 typLen, bool typByVal, + Datum *min, Datum *max, bool *p_have_data); static bool get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, Oid collation, @@ -679,7 +683,7 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, * compute the resulting contribution to selectivity. */ hist_selec = ineq_histogram_selectivity(root, vardata, - &opproc, isgt, iseq, + operator, &opproc, isgt, iseq, collation, constval, consttype); @@ -1019,6 +1023,9 @@ generic_restriction_selectivity(PlannerInfo *root, Oid oproid, Oid collation, * satisfies the inequality condition, ie, VAR < (or <=, >, >=) CONST. * The isgt and iseq flags distinguish which of the four cases apply. * + * While opproc could be looked up from the operator OID, common callers + * also need to call it separately, so we make the caller pass both. + * * Returns -1 if there is no histogram (valid results will always be >= 0). * * Note that the result disregards both the most-common-values (if any) and @@ -1030,7 +1037,7 @@ generic_restriction_selectivity(PlannerInfo *root, Oid oproid, Oid collation, double ineq_histogram_selectivity(PlannerInfo *root, VariableStatData *vardata, - FmgrInfo *opproc, bool isgt, bool iseq, + Oid opoid, FmgrInfo *opproc, bool isgt, bool iseq, Oid collation, Datum constval, Oid consttype) { @@ -1057,7 +1064,9 @@ ineq_histogram_selectivity(PlannerInfo *root, STATISTIC_KIND_HISTOGRAM, InvalidOid, ATTSTATSSLOT_VALUES)) { - if (sslot.nvalues > 1) + if (sslot.nvalues > 1 && + sslot.stacoll == collation && + comparison_ops_are_compatible(sslot.staop, opoid)) { /* * Use binary search to find the desired location, namely the @@ -1332,6 +1341,49 @@ ineq_histogram_selectivity(PlannerInfo *root, hist_selec = 1.0 - cutoff; } } + else if (sslot.nvalues > 1) + { + /* + * If we get here, we have a histogram but it's not sorted the way + * we want. Do a brute-force search to see how many of the + * entries satisfy the comparison condition, and take that + * fraction as our estimate. (This is identical to the inner loop + * of histogram_selectivity; maybe share code?) + */ + LOCAL_FCINFO(fcinfo, 2); + int nmatch = 0; + + InitFunctionCallInfoData(*fcinfo, opproc, 2, collation, + NULL, NULL); + fcinfo->args[0].isnull = false; + fcinfo->args[1].isnull = false; + fcinfo->args[1].value = constval; + for (int i = 0; i < sslot.nvalues; i++) + { + Datum fresult; + + fcinfo->args[0].value = sslot.values[i]; + fcinfo->isnull = false; + fresult = FunctionCallInvoke(fcinfo); + if (!fcinfo->isnull && DatumGetBool(fresult)) + nmatch++; + } + hist_selec = ((double) nmatch) / ((double) sslot.nvalues); + + /* + * As above, clamp to a hundredth of the histogram resolution. + * This case is surely even less trustworthy than the normal one, + * so we shouldn't believe exact 0 or 1 selectivity. + */ + { + double cutoff = 0.01 / (double) (sslot.nvalues - 1); + + if (hist_selec < cutoff) + hist_selec = cutoff; + else if (hist_selec > 1.0 - cutoff) + hist_selec = 1.0 - cutoff; + } + } free_attstatsslot(&sslot); } @@ -5363,8 +5415,8 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, int16 typLen; bool typByVal; Oid opfuncoid; + FmgrInfo opproc; AttStatsSlot sslot; - int i; /* * XXX It's very tempting to try to use the actual column min and max, if @@ -5395,20 +5447,19 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, (opfuncoid = get_opcode(sortop)))) return false; + opproc.fn_oid = InvalidOid; /* mark this as not looked up yet */ + get_typlenbyval(vardata->atttype, &typLen, &typByVal); /* - * If there is a histogram, grab the first and last values. - * - * If there is a histogram that is sorted with some other operator than - * the one we want, fail --- this suggests that there is data we can't - * use. XXX consider collation too. + * If there is a histogram with the ordering we want, grab the first and + * last values. */ if (get_attstatsslot(&sslot, vardata->statsTuple, STATISTIC_KIND_HISTOGRAM, sortop, ATTSTATSSLOT_VALUES)) { - if (sslot.nvalues > 0) + if (sslot.stacoll == collation && sslot.nvalues > 0) { tmin = datumCopy(sslot.values[0], typByVal, typLen); tmax = datumCopy(sslot.values[sslot.nvalues - 1], typByVal, typLen); @@ -5416,57 +5467,36 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, } free_attstatsslot(&sslot); } - else if (get_attstatsslot(&sslot, vardata->statsTuple, - STATISTIC_KIND_HISTOGRAM, InvalidOid, - 0)) + + /* + * Otherwise, if there is a histogram with some other ordering, scan it + * and get the min and max values according to the ordering we want. This + * of course may not find values that are really extremal according to our + * ordering, but it beats ignoring available data. + */ + if (!have_data && + get_attstatsslot(&sslot, vardata->statsTuple, + STATISTIC_KIND_HISTOGRAM, InvalidOid, + ATTSTATSSLOT_VALUES)) { + get_stats_slot_range(&sslot, opfuncoid, &opproc, + collation, typLen, typByVal, + &tmin, &tmax, &have_data); free_attstatsslot(&sslot); - return false; } /* * If we have most-common-values info, look for extreme MCVs. This is * needed even if we also have a histogram, since the histogram excludes - * the MCVs. However, usually the MCVs will not be the extreme values, so - * avoid unnecessary data copying. + * the MCVs. */ if (get_attstatsslot(&sslot, vardata->statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES)) { - bool tmin_is_mcv = false; - bool tmax_is_mcv = false; - FmgrInfo opproc; - - fmgr_info(opfuncoid, &opproc); - - for (i = 0; i < sslot.nvalues; i++) - { - if (!have_data) - { - tmin = tmax = sslot.values[i]; - tmin_is_mcv = tmax_is_mcv = have_data = true; - continue; - } - if (DatumGetBool(FunctionCall2Coll(&opproc, - collation, - sslot.values[i], tmin))) - { - tmin = sslot.values[i]; - tmin_is_mcv = true; - } - if (DatumGetBool(FunctionCall2Coll(&opproc, - collation, - tmax, sslot.values[i]))) - { - tmax = sslot.values[i]; - tmax_is_mcv = true; - } - } - if (tmin_is_mcv) - tmin = datumCopy(tmin, typByVal, typLen); - if (tmax_is_mcv) - tmax = datumCopy(tmax, typByVal, typLen); + get_stats_slot_range(&sslot, opfuncoid, &opproc, + collation, typLen, typByVal, + &tmin, &tmax, &have_data); free_attstatsslot(&sslot); } @@ -5475,6 +5505,61 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, return have_data; } +/* + * get_stats_slot_range: scan sslot for min/max values + * + * Subroutine for get_variable_range. + */ +static void +get_stats_slot_range(AttStatsSlot *sslot, Oid opfuncoid, FmgrInfo *opproc, + Oid collation, int16 typLen, bool typByVal, + Datum *min, Datum *max, bool *p_have_data) +{ + Datum tmin = *min; + Datum tmax = *max; + bool have_data = *p_have_data; + bool found_tmin = false; + bool found_tmax = false; + + /* Look up the comparison function, if we didn't already do so */ + if (opproc->fn_oid != opfuncoid) + fmgr_info(opfuncoid, opproc); + + /* Scan all the slot's values */ + for (int i = 0; i < sslot->nvalues; i++) + { + if (!have_data) + { + tmin = tmax = sslot->values[i]; + found_tmin = found_tmax = true; + *p_have_data = have_data = true; + continue; + } + if (DatumGetBool(FunctionCall2Coll(opproc, + collation, + sslot->values[i], tmin))) + { + tmin = sslot->values[i]; + found_tmin = true; + } + if (DatumGetBool(FunctionCall2Coll(opproc, + collation, + tmax, sslot->values[i]))) + { + tmax = sslot->values[i]; + found_tmax = true; + } + } + + /* + * Copy the slot's values, if we found new extreme values. + */ + if (found_tmin) + *min = datumCopy(tmin, typByVal, typLen); + if (found_tmax) + *max = datumCopy(tmax, typByVal, typLen); +} + /* * get_actual_variable_range diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index 63d1263502..f3bf413829 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -731,6 +731,55 @@ equality_ops_are_compatible(Oid opno1, Oid opno2) return result; } +/* + * comparison_ops_are_compatible + * Return true if the two given comparison operators have compatible + * semantics. + * + * This is trivially true if they are the same operator. Otherwise, + * we look to see if they can be found in the same btree opfamily. + * For example, '<' and '>=' ops match if they belong to the same family. + * + * (This is identical to equality_ops_are_compatible(), except that we + * don't bother to examine hash opclasses.) + */ +bool +comparison_ops_are_compatible(Oid opno1, Oid opno2) +{ + bool result; + CatCList *catlist; + int i; + + /* Easy if they're the same operator */ + if (opno1 == opno2) + return true; + + /* + * We search through all the pg_amop entries for opno1. + */ + catlist = SearchSysCacheList1(AMOPOPID, ObjectIdGetDatum(opno1)); + + result = false; + for (i = 0; i < catlist->n_members; i++) + { + HeapTuple op_tuple = &catlist->members[i]->tuple; + Form_pg_amop op_form = (Form_pg_amop) GETSTRUCT(op_tuple); + + if (op_form->amopmethod == BTREE_AM_OID) + { + if (op_in_opfamily(opno2, op_form->amopfamily)) + { + result = true; + break; + } + } + } + + ReleaseSysCacheList(catlist); + + return result; +} + /* ---------- AMPROC CACHES ---------- */ @@ -3028,19 +3077,6 @@ get_attstatsslot(AttStatsSlot *sslot, HeapTuple statstuple, sslot->staop = (&stats->staop1)[i]; sslot->stacoll = (&stats->stacoll1)[i]; - /* - * XXX Hopefully-temporary hack: if stacoll isn't set, inject the default - * collation. This won't matter for non-collation-aware datatypes. For - * those that are, this covers cases where stacoll has not been set. In - * the short term we need this because some code paths involving type NAME - * do not pass any collation to prefix_selectivity and related functions. - * Even when that's been fixed, it's likely that some add-on typanalyze - * functions won't get the word right away about filling stacoll during - * ANALYZE, so we'll probably need this for awhile. - */ - if (sslot->stacoll == InvalidOid) - sslot->stacoll = DEFAULT_COLLATION_OID; - if (flags & ATTSTATSSLOT_VALUES) { val = SysCacheGetAttr(STATRELATTINH, statstuple, diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h index 91aed1f5a5..fecfe1f4f6 100644 --- a/src/include/utils/lsyscache.h +++ b/src/include/utils/lsyscache.h @@ -82,6 +82,7 @@ extern bool get_op_hash_functions(Oid opno, RegProcedure *lhs_procno, RegProcedure *rhs_procno); extern List *get_op_btree_interpretation(Oid opno); extern bool equality_ops_are_compatible(Oid opno1, Oid opno2); +extern bool comparison_ops_are_compatible(Oid opno1, Oid opno2); extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum); extern char *get_attname(Oid relid, AttrNumber attnum, bool missing_ok); diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index 15d2289024..7ac4a06391 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -159,7 +159,8 @@ extern double generic_restriction_selectivity(PlannerInfo *root, double default_selectivity); extern double ineq_histogram_selectivity(PlannerInfo *root, VariableStatData *vardata, - FmgrInfo *opproc, bool isgt, bool iseq, + Oid opoid, FmgrInfo *opproc, + bool isgt, bool iseq, Oid collation, Datum constval, Oid consttype); extern double var_eq_const(VariableStatData *vardata, diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out index c2d037b614..7caf0c9b6b 100644 --- a/src/test/regress/expected/privileges.out +++ b/src/test/regress/expected/privileges.out @@ -191,7 +191,10 @@ CREATE TABLE atest12 as SELECT x AS a, 10001 - x AS b FROM generate_series(1,10000) x; CREATE INDEX ON atest12 (a); CREATE INDEX ON atest12 (abs(a)); +-- results below depend on having quite accurate stats for atest12 +SET default_statistics_target = 10000; VACUUM ANALYZE atest12; +RESET default_statistics_target; CREATE FUNCTION leak(integer,integer) RETURNS boolean AS $$begin return $1 < $2; end$$ LANGUAGE plpgsql immutable; diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql index 2ba69617dc..0ab5245b1e 100644 --- a/src/test/regress/sql/privileges.sql +++ b/src/test/regress/sql/privileges.sql @@ -136,7 +136,10 @@ CREATE TABLE atest12 as SELECT x AS a, 10001 - x AS b FROM generate_series(1,10000) x; CREATE INDEX ON atest12 (a); CREATE INDEX ON atest12 (abs(a)); +-- results below depend on having quite accurate stats for atest12 +SET default_statistics_target = 10000; VACUUM ANALYZE atest12; +RESET default_statistics_target; CREATE FUNCTION leak(integer,integer) RETURNS boolean AS $$begin return $1 < $2; end$$
Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
From
Tom Lane
Date:
I wrote: > 3. Hack things up so that the core code renames all these exposed > functions to, say, ineq_histogram_selectivity_ext() and so on, > allowing the additional arguments to exist, but the old names would > still be there as ABI compatibility wrappers. Here's a proposed v12 patch along those lines. regards, tom lane diff --git a/contrib/ltree/ltree_op.c b/contrib/ltree/ltree_op.c index 51545e0ef4..ec8e40626f 100644 --- a/contrib/ltree/ltree_op.c +++ b/contrib/ltree/ltree_op.c @@ -578,6 +578,7 @@ ltreeparentsel(PG_FUNCTION_ARGS) Oid operator = PG_GETARG_OID(1); List *args = (List *) PG_GETARG_POINTER(2); int varRelid = PG_GETARG_INT32(3); + Oid collation = PG_GET_COLLATION(); VariableStatData vardata; Node *other; bool varonleft; @@ -617,8 +618,9 @@ ltreeparentsel(PG_FUNCTION_ARGS) /* * Is the constant "<@" to any of the column's most common values? */ - mcvsel = mcv_selectivity(&vardata, &contproc, constval, varonleft, - &mcvsum); + mcvsel = mcv_selectivity_ext(&vardata, &contproc, collation, + constval, varonleft, + &mcvsum); /* * If the histogram is large enough, see what fraction of it the @@ -626,9 +628,9 @@ ltreeparentsel(PG_FUNCTION_ARGS) * non-MCV population. Otherwise use the default selectivity for the * non-MCV population. */ - selec = histogram_selectivity(&vardata, &contproc, - constval, varonleft, - 10, 1, &hist_size); + selec = histogram_selectivity_ext(&vardata, &contproc, collation, + constval, varonleft, + 10, 1, &hist_size); if (selec < 0) { /* Nope, fall back on default */ diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c index 77cc378196..6465c9edcc 100644 --- a/src/backend/utils/adt/like_support.c +++ b/src/backend/utils/adt/like_support.c @@ -90,7 +90,9 @@ static Pattern_Prefix_Status pattern_fixed_prefix(Const *patt, Selectivity *rest_selec); static Selectivity prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, - Oid vartype, Oid opfamily, Const *prefixcon); + Oid vartype, Oid opfamily, + Oid collation, + Const *prefixcon); static Selectivity like_selectivity(const char *patt, int pattlen, bool case_insensitive); static Selectivity regex_selectivity(const char *patt, int pattlen, @@ -586,8 +588,8 @@ patternsel_common(PlannerInfo *root, if (eqopr == InvalidOid) elog(ERROR, "no = operator for opfamily %u", opfamily); - result = var_eq_const(&vardata, eqopr, prefix->constvalue, - false, true, false); + result = var_eq_const_ext(&vardata, eqopr, collation, + prefix->constvalue, false, true, false); } else { @@ -618,8 +620,9 @@ patternsel_common(PlannerInfo *root, opfuncid = get_opcode(oprid); fmgr_info(opfuncid, &opproc); - selec = histogram_selectivity(&vardata, &opproc, constval, true, - 10, 1, &hist_size); + selec = histogram_selectivity_ext(&vardata, &opproc, collation, + constval, true, + 10, 1, &hist_size); /* If not at least 100 entries, use the heuristic method */ if (hist_size < 100) @@ -629,7 +632,7 @@ patternsel_common(PlannerInfo *root, if (pstatus == Pattern_Prefix_Partial) prefixsel = prefix_selectivity(root, &vardata, vartype, - opfamily, prefix); + opfamily, collation, prefix); else prefixsel = 1.0; heursel = prefixsel * rest_selec; @@ -661,8 +664,9 @@ patternsel_common(PlannerInfo *root, * directly to the result selectivity. Also add up the total fraction * represented by MCV entries. */ - mcv_selec = mcv_selectivity(&vardata, &opproc, constval, true, - &sumcommon); + mcv_selec = mcv_selectivity_ext(&vardata, &opproc, collation, + constval, true, + &sumcommon); /* * Now merge the results from the MCV and histogram calculations, @@ -1170,12 +1174,13 @@ pattern_fixed_prefix(Const *patt, Pattern_Type ptype, Oid collation, */ static Selectivity prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, - Oid vartype, Oid opfamily, Const *prefixcon) + Oid vartype, Oid opfamily, + Oid collation, + Const *prefixcon) { Selectivity prefixsel; Oid cmpopr; FmgrInfo opproc; - AttStatsSlot sslot; Const *greaterstrcon; Selectivity eq_sel; @@ -1185,10 +1190,11 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, elog(ERROR, "no >= operator for opfamily %u", opfamily); fmgr_info(get_opcode(cmpopr), &opproc); - prefixsel = ineq_histogram_selectivity(root, vardata, - &opproc, true, true, - prefixcon->constvalue, - prefixcon->consttype); + prefixsel = ineq_histogram_selectivity_ext(root, vardata, + &opproc, true, true, + collation, + prefixcon->constvalue, + prefixcon->consttype); if (prefixsel < 0.0) { @@ -1196,33 +1202,24 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, return DEFAULT_MATCH_SEL; } - /*------- - * If we can create a string larger than the prefix, say - * "x < greaterstr". We try to generate the string referencing the - * collation of the var's statistics, but if that's not available, - * use DEFAULT_COLLATION_OID. - *------- + /* + * If we can create a string larger than the prefix, say "x < greaterstr". */ - if (HeapTupleIsValid(vardata->statsTuple) && - get_attstatsslot(&sslot, vardata->statsTuple, - STATISTIC_KIND_HISTOGRAM, InvalidOid, 0)) - /* sslot.stacoll is set up */ ; - else - sslot.stacoll = DEFAULT_COLLATION_OID; cmpopr = get_opfamily_member(opfamily, vartype, vartype, BTLessStrategyNumber); if (cmpopr == InvalidOid) elog(ERROR, "no < operator for opfamily %u", opfamily); fmgr_info(get_opcode(cmpopr), &opproc); - greaterstrcon = make_greater_string(prefixcon, &opproc, sslot.stacoll); + greaterstrcon = make_greater_string(prefixcon, &opproc, collation); if (greaterstrcon) { Selectivity topsel; - topsel = ineq_histogram_selectivity(root, vardata, - &opproc, false, false, - greaterstrcon->constvalue, - greaterstrcon->consttype); + topsel = ineq_histogram_selectivity_ext(root, vardata, + &opproc, false, false, + collation, + greaterstrcon->constvalue, + greaterstrcon->consttype); /* ineq_histogram_selectivity worked before, it shouldn't fail now */ Assert(topsel >= 0.0); @@ -1253,8 +1250,8 @@ prefix_selectivity(PlannerInfo *root, VariableStatData *vardata, BTEqualStrategyNumber); if (cmpopr == InvalidOid) elog(ERROR, "no = operator for opfamily %u", opfamily); - eq_sel = var_eq_const(vardata, cmpopr, prefixcon->constvalue, - false, true, false); + eq_sel = var_eq_const_ext(vardata, cmpopr, collation, prefixcon->constvalue, + false, true, false); prefixsel = Max(prefixsel, eq_sel); diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c index 5e0f0614ee..89131ebab7 100644 --- a/src/backend/utils/adt/network_selfuncs.c +++ b/src/backend/utils/adt/network_selfuncs.c @@ -137,8 +137,9 @@ networksel(PG_FUNCTION_ARGS) * by MCV entries. */ fmgr_info(get_opcode(operator), &proc); - mcv_selec = mcv_selectivity(&vardata, &proc, constvalue, varonleft, - &sumcommon); + mcv_selec = mcv_selectivity_ext(&vardata, &proc, InvalidOid, + constvalue, varonleft, + &sumcommon); /* * If we have a histogram, use it to estimate the proportion of the diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index b67897da88..f9a2c96b0e 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -88,11 +88,7 @@ * (if any) is passed using the standard fmgr mechanism, so that the estimator * function can fetch it with PG_GET_COLLATION(). Note, however, that all * statistics in pg_statistic are currently built using the relevant column's - * collation. Thus, in most cases where we are looking at statistics, we - * should ignore the operator collation and use the stats entry's collation. - * We expect that the error induced by doing this is usually not large enough - * to justify complicating matters. In any case, doing otherwise would yield - * entirely garbage results for ordered stats data such as histograms. + * collation. *---------- */ @@ -148,14 +144,14 @@ get_relation_stats_hook_type get_relation_stats_hook = NULL; get_index_stats_hook_type get_index_stats_hook = NULL; static double eqsel_internal(PG_FUNCTION_ARGS, bool negate); -static double eqjoinsel_inner(Oid opfuncoid, +static double eqjoinsel_inner(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, AttStatsSlot *sslot1, AttStatsSlot *sslot2, Form_pg_statistic stats1, Form_pg_statistic stats2, bool have_mcvs1, bool have_mcvs2); -static double eqjoinsel_semi(Oid opfuncoid, +static double eqjoinsel_semi(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, @@ -193,10 +189,11 @@ static double convert_timevalue_to_scalar(Datum value, Oid typid, static void examine_simple_variable(PlannerInfo *root, Var *var, VariableStatData *vardata); static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, - Oid sortop, Datum *min, Datum *max); + Oid sortop, Oid collation, + Datum *min, Datum *max); static bool get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, - Oid sortop, + Oid sortop, Oid collation, Datum *min, Datum *max); static bool get_actual_variable_endpoint(Relation heapRel, Relation indexRel, @@ -234,6 +231,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) Oid operator = PG_GETARG_OID(1); List *args = (List *) PG_GETARG_POINTER(2); int varRelid = PG_GETARG_INT32(3); + Oid collation = PG_GET_COLLATION(); VariableStatData vardata; Node *other; bool varonleft; @@ -267,10 +265,10 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) * in the query.) */ if (IsA(other, Const)) - selec = var_eq_const(&vardata, operator, - ((Const *) other)->constvalue, - ((Const *) other)->constisnull, - varonleft, negate); + selec = var_eq_const_ext(&vardata, operator, collation, + ((Const *) other)->constvalue, + ((Const *) other)->constisnull, + varonleft, negate); else selec = var_eq_non_const(&vardata, operator, other, varonleft, negate); @@ -289,6 +287,16 @@ double var_eq_const(VariableStatData *vardata, Oid operator, Datum constval, bool constisnull, bool varonleft, bool negate) +{ + return var_eq_const_ext(vardata, operator, DEFAULT_COLLATION_OID, + constval, constisnull, + varonleft, negate); +} + +double +var_eq_const_ext(VariableStatData *vardata, Oid operator, Oid collation, + Datum constval, bool constisnull, + bool varonleft, bool negate) { double selec; double nullfrac = 0.0; @@ -353,12 +361,12 @@ var_eq_const(VariableStatData *vardata, Oid operator, /* be careful to apply operator right way 'round */ if (varonleft) match = DatumGetBool(FunctionCall2Coll(&eqproc, - sslot.stacoll, + collation, sslot.values[i], constval)); else match = DatumGetBool(FunctionCall2Coll(&eqproc, - sslot.stacoll, + collation, constval, sslot.values[i])); if (match) @@ -555,6 +563,7 @@ neqsel(PG_FUNCTION_ARGS) */ static double scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, + Oid collation, VariableStatData *vardata, Datum constval, Oid consttype) { Form_pg_statistic stats; @@ -654,16 +663,17 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, * to the result selectivity. Also add up the total fraction represented * by MCV entries. */ - mcv_selec = mcv_selectivity(vardata, &opproc, constval, true, - &sumcommon); + mcv_selec = mcv_selectivity_ext(vardata, &opproc, collation, constval, true, + &sumcommon); /* * If there is a histogram, determine which bin the constant falls in, and * compute the resulting contribution to selectivity. */ - hist_selec = ineq_histogram_selectivity(root, vardata, - &opproc, isgt, iseq, - constval, consttype); + hist_selec = ineq_histogram_selectivity_ext(root, vardata, + &opproc, isgt, iseq, + collation, + constval, consttype); /* * Now merge the results from the MCV and histogram calculations, @@ -707,6 +717,15 @@ double mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, Datum constval, bool varonleft, double *sumcommonp) +{ + return mcv_selectivity_ext(vardata, opproc, DEFAULT_COLLATION_OID, + constval, varonleft, sumcommonp); +} + +double +mcv_selectivity_ext(VariableStatData *vardata, FmgrInfo *opproc, Oid collation, + Datum constval, bool varonleft, + double *sumcommonp) { double mcv_selec, sumcommon; @@ -726,11 +745,11 @@ mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, { if (varonleft ? DatumGetBool(FunctionCall2Coll(opproc, - sslot.stacoll, + collation, sslot.values[i], constval)) : DatumGetBool(FunctionCall2Coll(opproc, - sslot.stacoll, + collation, constval, sslot.values[i]))) mcv_selec += sslot.numbers[i]; @@ -780,6 +799,20 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, Datum constval, bool varonleft, int min_hist_size, int n_skip, int *hist_size) +{ + return histogram_selectivity_ext(vardata, + opproc, DEFAULT_COLLATION_OID, + constval, varonleft, + min_hist_size, n_skip, + hist_size); +} + +double +histogram_selectivity_ext(VariableStatData *vardata, + FmgrInfo *opproc, Oid collation, + Datum constval, bool varonleft, + int min_hist_size, int n_skip, + int *hist_size) { double result; AttStatsSlot sslot; @@ -804,11 +837,11 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, { if (varonleft ? DatumGetBool(FunctionCall2Coll(opproc, - sslot.stacoll, + collation, sslot.values[i], constval)) : DatumGetBool(FunctionCall2Coll(opproc, - sslot.stacoll, + collation, constval, sslot.values[i]))) nmatch++; @@ -848,6 +881,19 @@ ineq_histogram_selectivity(PlannerInfo *root, VariableStatData *vardata, FmgrInfo *opproc, bool isgt, bool iseq, Datum constval, Oid consttype) +{ + return ineq_histogram_selectivity_ext(root, vardata, + opproc, isgt, iseq, + DEFAULT_COLLATION_OID, + constval, consttype); +} + +double +ineq_histogram_selectivity_ext(PlannerInfo *root, + VariableStatData *vardata, + FmgrInfo *opproc, bool isgt, bool iseq, + Oid collation, + Datum constval, Oid consttype) { double hist_selec; AttStatsSlot sslot; @@ -860,9 +906,11 @@ ineq_histogram_selectivity(PlannerInfo *root, * column type. However, to make that work we will need to figure out * which staop to search for --- it's not necessarily the one we have at * hand! (For example, we might have a '<=' operator rather than the '<' - * operator that will appear in staop.) For now, assume that whatever - * appears in pg_statistic is sorted the same way our operator sorts, or - * the reverse way if isgt is true. + * operator that will appear in staop.) The collation might not agree + * either. For now, just assume that whatever appears in pg_statistic is + * sorted the same way our operator sorts, or the reverse way if isgt is + * true. This could result in a bogus estimate, but it still seems better + * than falling back to the default estimate. */ if (HeapTupleIsValid(vardata->statsTuple) && statistic_proc_security_check(vardata, opproc->fn_oid) && @@ -908,6 +956,7 @@ ineq_histogram_selectivity(PlannerInfo *root, have_end = get_actual_variable_range(root, vardata, sslot.staop, + collation, &sslot.values[0], &sslot.values[1]); @@ -925,17 +974,19 @@ ineq_histogram_selectivity(PlannerInfo *root, have_end = get_actual_variable_range(root, vardata, sslot.staop, + collation, &sslot.values[0], NULL); else if (probe == sslot.nvalues - 1 && sslot.nvalues > 2) have_end = get_actual_variable_range(root, vardata, sslot.staop, + collation, NULL, &sslot.values[probe]); ltcmp = DatumGetBool(FunctionCall2Coll(opproc, - sslot.stacoll, + collation, sslot.values[probe], constval)); if (isgt) @@ -1020,7 +1071,7 @@ ineq_histogram_selectivity(PlannerInfo *root, * values to a uniform comparison scale, and do a linear * interpolation within this bin. */ - if (convert_to_scalar(constval, consttype, sslot.stacoll, + if (convert_to_scalar(constval, consttype, collation, &val, sslot.values[i - 1], sslot.values[i], vardata->vartype, @@ -1160,6 +1211,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq) Oid operator = PG_GETARG_OID(1); List *args = (List *) PG_GETARG_POINTER(2); int varRelid = PG_GETARG_INT32(3); + Oid collation = PG_GET_COLLATION(); VariableStatData vardata; Node *other; bool varonleft; @@ -1212,7 +1264,7 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq) } /* The rest of the work is done by scalarineqsel(). */ - selec = scalarineqsel(root, operator, isgt, iseq, + selec = scalarineqsel(root, operator, isgt, iseq, collation, &vardata, constval, consttype); ReleaseVariableStats(vardata); @@ -1277,8 +1329,8 @@ boolvarsel(PlannerInfo *root, Node *arg, int varRelid) * A boolean variable V is equivalent to the clause V = 't', so we * compute the selectivity as if that is what we have. */ - selec = var_eq_const(&vardata, BooleanEqualOperator, - BoolGetDatum(true), false, true, false); + selec = var_eq_const_ext(&vardata, BooleanEqualOperator, InvalidOid, + BoolGetDatum(true), false, true, false); } else { @@ -2003,6 +2055,7 @@ eqjoinsel(PG_FUNCTION_ARGS) JoinType jointype = (JoinType) PG_GETARG_INT16(3); #endif SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4); + Oid collation = PG_GET_COLLATION(); double selec; double selec_inner; VariableStatData vardata1; @@ -2053,7 +2106,7 @@ eqjoinsel(PG_FUNCTION_ARGS) } /* We need to compute the inner-join selectivity in all cases */ - selec_inner = eqjoinsel_inner(opfuncoid, + selec_inner = eqjoinsel_inner(opfuncoid, collation, &vardata1, &vardata2, nd1, nd2, isdefault1, isdefault2, @@ -2080,7 +2133,7 @@ eqjoinsel(PG_FUNCTION_ARGS) inner_rel = find_join_input_rel(root, sjinfo->min_righthand); if (!join_is_reversed) - selec = eqjoinsel_semi(opfuncoid, + selec = eqjoinsel_semi(opfuncoid, collation, &vardata1, &vardata2, nd1, nd2, isdefault1, isdefault2, @@ -2093,7 +2146,7 @@ eqjoinsel(PG_FUNCTION_ARGS) Oid commop = get_commutator(operator); Oid commopfuncoid = OidIsValid(commop) ? get_opcode(commop) : InvalidOid; - selec = eqjoinsel_semi(commopfuncoid, + selec = eqjoinsel_semi(commopfuncoid, collation, &vardata2, &vardata1, nd2, nd1, isdefault2, isdefault1, @@ -2141,7 +2194,7 @@ eqjoinsel(PG_FUNCTION_ARGS) * that it's worth trying to distinguish them here. */ static double -eqjoinsel_inner(Oid opfuncoid, +eqjoinsel_inner(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, @@ -2203,7 +2256,7 @@ eqjoinsel_inner(Oid opfuncoid, if (hasmatch2[j]) continue; if (DatumGetBool(FunctionCall2Coll(&eqproc, - sslot1->stacoll, + collation, sslot1->values[i], sslot2->values[j]))) { @@ -2321,7 +2374,7 @@ eqjoinsel_inner(Oid opfuncoid, * Unlike eqjoinsel_inner, we have to cope with opfuncoid being InvalidOid. */ static double -eqjoinsel_semi(Oid opfuncoid, +eqjoinsel_semi(Oid opfuncoid, Oid collation, VariableStatData *vardata1, VariableStatData *vardata2, double nd1, double nd2, bool isdefault1, bool isdefault2, @@ -2415,7 +2468,7 @@ eqjoinsel_semi(Oid opfuncoid, if (hasmatch2[j]) continue; if (DatumGetBool(FunctionCall2Coll(&eqproc, - sslot1->stacoll, + collation, sslot1->values[i], sslot2->values[j]))) { @@ -2635,6 +2688,7 @@ mergejoinscansel(PlannerInfo *root, Node *clause, Oid op_lefttype; Oid op_righttype; Oid opno, + collation, lsortop, rsortop, lstatop, @@ -2659,6 +2713,7 @@ mergejoinscansel(PlannerInfo *root, Node *clause, if (!is_opclause(clause)) return; /* shouldn't happen */ opno = ((OpExpr *) clause)->opno; + collation = ((OpExpr *) clause)->inputcollid; left = get_leftop((Expr *) clause); right = get_rightop((Expr *) clause); if (!right) @@ -2792,20 +2847,20 @@ mergejoinscansel(PlannerInfo *root, Node *clause, /* Try to get ranges of both inputs */ if (!isgt) { - if (!get_variable_range(root, &leftvar, lstatop, + if (!get_variable_range(root, &leftvar, lstatop, collation, &leftmin, &leftmax)) goto fail; /* no range available from stats */ - if (!get_variable_range(root, &rightvar, rstatop, + if (!get_variable_range(root, &rightvar, rstatop, collation, &rightmin, &rightmax)) goto fail; /* no range available from stats */ } else { /* need to swap the max and min */ - if (!get_variable_range(root, &leftvar, lstatop, + if (!get_variable_range(root, &leftvar, lstatop, collation, &leftmax, &leftmin)) goto fail; /* no range available from stats */ - if (!get_variable_range(root, &rightvar, rstatop, + if (!get_variable_range(root, &rightvar, rstatop, collation, &rightmax, &rightmin)) goto fail; /* no range available from stats */ } @@ -2815,13 +2870,13 @@ mergejoinscansel(PlannerInfo *root, Node *clause, * fraction that's <= the right-side maximum value. But only believe * non-default estimates, else stick with our 1.0. */ - selec = scalarineqsel(root, leop, isgt, true, &leftvar, + selec = scalarineqsel(root, leop, isgt, true, collation, &leftvar, rightmax, op_righttype); if (selec != DEFAULT_INEQ_SEL) *leftend = selec; /* And similarly for the right variable. */ - selec = scalarineqsel(root, revleop, isgt, true, &rightvar, + selec = scalarineqsel(root, revleop, isgt, true, collation, &rightvar, leftmax, op_lefttype); if (selec != DEFAULT_INEQ_SEL) *rightend = selec; @@ -2845,13 +2900,13 @@ mergejoinscansel(PlannerInfo *root, Node *clause, * minimum value. But only believe non-default estimates, else stick with * our own default. */ - selec = scalarineqsel(root, ltop, isgt, false, &leftvar, + selec = scalarineqsel(root, ltop, isgt, false, collation, &leftvar, rightmin, op_righttype); if (selec != DEFAULT_INEQ_SEL) *leftstart = selec; /* And similarly for the right variable. */ - selec = scalarineqsel(root, revltop, isgt, false, &rightvar, + selec = scalarineqsel(root, revltop, isgt, false, collation, &rightvar, leftmin, op_lefttype); if (selec != DEFAULT_INEQ_SEL) *rightstart = selec; @@ -5124,9 +5179,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault) * * sortop is the "<" comparison operator to use. This should generally * be "<" not ">", as only the former is likely to be found in pg_statistic. + * The collation must be specified too. */ static bool -get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, +get_variable_range(PlannerInfo *root, VariableStatData *vardata, + Oid sortop, Oid collation, Datum *min, Datum *max) { Datum tmin = 0; @@ -5146,7 +5203,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, * before enabling this. */ #ifdef NOT_USED - if (get_actual_variable_range(root, vardata, sortop, min, max)) + if (get_actual_variable_range(root, vardata, sortop, collation, min, max)) return true; #endif @@ -5174,7 +5231,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, * * If there is a histogram that is sorted with some other operator than * the one we want, fail --- this suggests that there is data we can't - * use. + * use. XXX consider collation too. */ if (get_attstatsslot(&sslot, vardata->statsTuple, STATISTIC_KIND_HISTOGRAM, sortop, @@ -5221,14 +5278,14 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, continue; } if (DatumGetBool(FunctionCall2Coll(&opproc, - sslot.stacoll, + collation, sslot.values[i], tmin))) { tmin = sslot.values[i]; tmin_is_mcv = true; } if (DatumGetBool(FunctionCall2Coll(&opproc, - sslot.stacoll, + collation, tmax, sslot.values[i]))) { tmax = sslot.values[i]; @@ -5258,10 +5315,11 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, * If no data available, return false. * * sortop is the "<" comparison operator to use. + * collation is the required collation. */ static bool get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, - Oid sortop, + Oid sortop, Oid collation, Datum *min, Datum *max) { bool have_data = false; @@ -5301,9 +5359,11 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, continue; /* - * The first index column must match the desired variable and sort - * operator --- but we can use a descending-order index. + * The first index column must match the desired variable, sortop, and + * collation --- but we can use a descending-order index. */ + if (collation != index->indexcollations[0]) + continue; /* test first 'cause it's cheapest */ if (!match_index_to_operand(vardata->var, 0, index)) continue; switch (get_op_opfamily_strategy(sortop, index->sortopfamily[0])) diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index 85d9ecbfc6..521cd84130 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -143,17 +143,36 @@ extern double get_variable_numdistinct(VariableStatData *vardata, extern double mcv_selectivity(VariableStatData *vardata, FmgrInfo *opproc, Datum constval, bool varonleft, double *sumcommonp); +extern double mcv_selectivity_ext(VariableStatData *vardata, + FmgrInfo *opproc, Oid collation, + Datum constval, bool varonleft, + double *sumcommonp); extern double histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc, Datum constval, bool varonleft, int min_hist_size, int n_skip, int *hist_size); +extern double histogram_selectivity_ext(VariableStatData *vardata, + FmgrInfo *opproc, Oid collation, + Datum constval, bool varonleft, + int min_hist_size, int n_skip, + int *hist_size); extern double ineq_histogram_selectivity(PlannerInfo *root, VariableStatData *vardata, FmgrInfo *opproc, bool isgt, bool iseq, Datum constval, Oid consttype); +extern double ineq_histogram_selectivity_ext(PlannerInfo *root, + VariableStatData *vardata, + FmgrInfo *opproc, + bool isgt, bool iseq, + Oid collation, + Datum constval, Oid consttype); extern double var_eq_const(VariableStatData *vardata, Oid oproid, Datum constval, bool constisnull, bool varonleft, bool negate); +extern double var_eq_const_ext(VariableStatData *vardata, + Oid oproid, Oid collation, + Datum constval, bool constisnull, + bool varonleft, bool negate); extern double var_eq_non_const(VariableStatData *vardata, Oid oproid, Node *other, bool varonleft, bool negate);