Thread: Nondeterministic collations vs. text_pattern_ops
Whilst poking at the leakproofness-of-texteq issue, I realized that there's an independent problem caused by the nondeterminism patch. To wit, that the text_pattern_ops btree opclass uses texteq as its equality operator, even though that operator is no longer guaranteed to be bitwise equality. That means that depending on which collation happens to get attached to the operator, equality might be inconsistent with the other members of the opclass, leading to who-knows-what bad results. bpchar_pattern_ops has the same issue with respect to bpchareq. The obvious fix for this is to invent separate new equality operators, but that's actually rather disastrous for performance, because text_pattern_ops indexes would no longer be able to use WHERE clauses using plain equality. That also feeds into whether equality clauses deduced from equivalence classes will work for them (nope, not any more). People using such indexes are just about certain to be bitterly unhappy. We may not have any choice but to do that, though --- I sure don't see any other easy fix. If we could be certain that the collation attached to the operator is deterministic, then it would still work with a pattern_ops index, but that's not a concept that the index infrastructure has got right now. Whatever we do about this is likely to require a catversion bump, meaning we've got to fix it *now*. regards, tom lane
On 2019-09-17 01:13, Tom Lane wrote: > Whilst poking at the leakproofness-of-texteq issue, I realized > that there's an independent problem caused by the nondeterminism > patch. To wit, that the text_pattern_ops btree opclass uses > texteq as its equality operator, even though that operator is > no longer guaranteed to be bitwise equality. That means that > depending on which collation happens to get attached to the > operator, equality might be inconsistent with the other members > of the opclass, leading to who-knows-what bad results. You can't create a text_pattern_ops index on a column with nondeterministic collation: create collation c1 (provider = icu, locale = 'und', deterministic = false); create table t1 (a int, b text collate c1); create index on t1 (b text_pattern_ops); ERROR: nondeterministic collations are not supported for operator class "text_pattern_ops" There is some discussion in internal_text_pattern_compare(). Are there other cases we need to consider? I notice that there is a hash opclass text_pattern_ops, which I'd actually never heard of until now, and I don't see documented. What would we need to do about that? > The obvious fix for this is to invent separate new equality operators, > but that's actually rather disastrous for performance, because > text_pattern_ops indexes would no longer be able to use WHERE clauses > using plain equality. That also feeds into whether equality clauses > deduced from equivalence classes will work for them (nope, not any > more). People using such indexes are just about certain to be > bitterly unhappy. Would it help if one created COLLATE "C" indexes instead of text_pattern_ops? What are the tradeoffs between the two approaches? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 2019-09-17 01:13, Tom Lane wrote: >> Whilst poking at the leakproofness-of-texteq issue, I realized >> that there's an independent problem caused by the nondeterminism >> patch. To wit, that the text_pattern_ops btree opclass uses >> texteq as its equality operator, even though that operator is >> no longer guaranteed to be bitwise equality. That means that >> depending on which collation happens to get attached to the >> operator, equality might be inconsistent with the other members >> of the opclass, leading to who-knows-what bad results. > You can't create a text_pattern_ops index on a column with > nondeterministic collation: > create collation c1 (provider = icu, locale = 'und', deterministic = false); > create table t1 (a int, b text collate c1); > create index on t1 (b text_pattern_ops); > ERROR: nondeterministic collations are not supported for operator class > "text_pattern_ops" Oh! I'd seen that error message, but not realized that it'd get thrown during index creation. > There is some discussion in internal_text_pattern_compare(). I don't much like doing it that way: looking up the determinism property of the collation over again in every single comparison seems pretty expensive, plus the operator is way exceeding its actual knowledge of the call context by throwing an error phrased that way. > Are there other cases we need to consider? I think that disallowing indexes with this combination of opclass and collation may actually be sufficient. A query can request equality using any collation, but it won't get matched to an index with a different collation, so I think we're safe against index-related problems if we have that restriction. AFAIR, the only other place in the system where non-default opclasses can be invoked is ORDER BY. Somebody could write ORDER BY textcol COLLATE "nondeterm" USING ~<~ However, I think we're actually okay on that one, because although the equality opclass member is out of sync with the rest, it won't get consulted during a sort. internal_text_pattern_compare will throw an error for this, but I don't believe it actually needs to. My recommendation is to get rid of the run-time checks and instead put a hack like this into DefineIndex or some minion thereof: if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID || opclass == VARCHAR_PATTERN_BTREE_CLASS_OID || opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) && !get_collation_isdeterministic(collid)) ereport(ERROR, ...); Hard-wiring that is ugly; maybe someday we'd wish to expose "doesn't allow nondeterminism" as a more generally-available opclass property. But without some other examples that have a need for it, I think it's not worth the work to create infrastructure for that. It's not like there are no other hard-wired legacy behaviors in DefineIndex... > I notice that there is a hash opclass text_pattern_ops, which I'd > actually never heard of until now, and I don't see documented. What > would we need to do about that? Huh. I wonder why that's there --- I can't see a reason why it'd behave differently from the regular hash text_ops. Maybe we feared that someday it would need to be different? Anyway, I think we can just ignore it for this purpose. > Would it help if one created COLLATE "C" indexes instead of > text_pattern_ops? What are the tradeoffs between the two approaches? Of course, the pattern_ops opclasses long predate our COLLATE support. I suspect if we'd had COLLATE we never would have invented them. I believe the pattern_ops are equivalent to a COLLATE "C" index, both theoretically and in terms of the optimizations we know about making. There might be some minor performance difference from not having to look up the collation, but not much if we've done the collate-is-c fast paths properly. regards, tom lane
On 2019-09-17 17:17, Tom Lane wrote: > My recommendation is to get rid of the run-time checks and instead > put a hack like this into DefineIndex or some minion thereof: > > if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID || > opclass == VARCHAR_PATTERN_BTREE_CLASS_OID || > opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) && > !get_collation_isdeterministic(collid)) > ereport(ERROR, ...); Here is a draft patch. It will require a catversion change because those operator classes don't have assigned OIDs so far. The comment block I just moved over for the time being. It should probably be rephrased a bit. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > Here is a draft patch. > It will require a catversion change because those operator classes don't > have assigned OIDs so far. That's slightly annoying given where we are with v12. We could avoid it by looking up the opclass's opfamily and seeing if it's TEXT_BTREE_FAM_OID etc, which do already have hand-assigned OIDs. But maybe avoiding a catversion bump now is not worth the cost of an extra syscache lookup. (It'd give me an excuse to shove the leakproofness-marking changes from the other thread into v12, so there's that.) Speaking of extra syscache lookups, I don't like that you rearranged the if-test to check nondeterminism before the opclass identity checks. That's usually going to be a wasted lookup. > The comment block I just moved over for the time being. It should > probably be rephrased a bit. Indeed. Maybe like * text_pattern_ops uses text_eq as the equality operator, which is * fine as long as the collation is deterministic; text_eq then * reduces to bitwise equality and so it is semantically compatible * with the other operators and functions in the opclass. But with a * nondeterministic collation, text_eq could yield results that are * incompatible with the actual behavior of the index (which is * determined by the opclass's comparison function). We prevent * such problems by refusing creation of an index with this opclass * and a nondeterministic collation. * * The same applies to varchar_pattern_ops and bpchar_pattern_ops. * If we find more cases, we might decide to create a real mechanism * for marking opclasses as incompatible with nondeterminism; but * for now, this small hack suffices. * * Another solution is to use a special operator, not text_eq, as the * equality opclass member, but that is undesirable because it would * prevent index usage in many queries that work fine today. regards, tom lane
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> Here is a draft patch. Where are we on pushing that? I'm starting to get antsy about the amount of time remaining before rc1. It's a low-risk fix, but still, it'd be best to have a complete buildfarm cycle on it before Monday's wrap. regards, tom lane
I wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >>> Here is a draft patch. > Where are we on pushing that? I'm starting to get antsy about the > amount of time remaining before rc1. It's a low-risk fix, but still, > it'd be best to have a complete buildfarm cycle on it before Monday's > wrap. Since time is now really running short, I went ahead and pushed this, after doing a closer review and finding one nitpicky bug. regards, tom lane
On 2019-09-21 22:30, Tom Lane wrote: > I wrote: >> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >>>> Here is a draft patch. > >> Where are we on pushing that? I'm starting to get antsy about the >> amount of time remaining before rc1. It's a low-risk fix, but still, >> it'd be best to have a complete buildfarm cycle on it before Monday's >> wrap. > > Since time is now really running short, I went ahead and pushed > this, after doing a closer review and finding one nitpicky bug. Great! I think that covers all the open issues then. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services