Re: Nondeterministic collations vs. text_pattern_ops - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Nondeterministic collations vs. text_pattern_ops |
Date | |
Msg-id | 3615.1568733452@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Nondeterministic collations vs. text_pattern_ops (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>) |
Responses |
Re: Nondeterministic collations vs. text_pattern_ops
|
List | pgsql-hackers |
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 2019-09-17 01:13, Tom Lane wrote: >> Whilst poking at the leakproofness-of-texteq issue, I realized >> that there's an independent problem caused by the nondeterminism >> patch. To wit, that the text_pattern_ops btree opclass uses >> texteq as its equality operator, even though that operator is >> no longer guaranteed to be bitwise equality. That means that >> depending on which collation happens to get attached to the >> operator, equality might be inconsistent with the other members >> of the opclass, leading to who-knows-what bad results. > You can't create a text_pattern_ops index on a column with > nondeterministic collation: > create collation c1 (provider = icu, locale = 'und', deterministic = false); > create table t1 (a int, b text collate c1); > create index on t1 (b text_pattern_ops); > ERROR: nondeterministic collations are not supported for operator class > "text_pattern_ops" Oh! I'd seen that error message, but not realized that it'd get thrown during index creation. > There is some discussion in internal_text_pattern_compare(). I don't much like doing it that way: looking up the determinism property of the collation over again in every single comparison seems pretty expensive, plus the operator is way exceeding its actual knowledge of the call context by throwing an error phrased that way. > Are there other cases we need to consider? I think that disallowing indexes with this combination of opclass and collation may actually be sufficient. A query can request equality using any collation, but it won't get matched to an index with a different collation, so I think we're safe against index-related problems if we have that restriction. AFAIR, the only other place in the system where non-default opclasses can be invoked is ORDER BY. Somebody could write ORDER BY textcol COLLATE "nondeterm" USING ~<~ However, I think we're actually okay on that one, because although the equality opclass member is out of sync with the rest, it won't get consulted during a sort. internal_text_pattern_compare will throw an error for this, but I don't believe it actually needs to. My recommendation is to get rid of the run-time checks and instead put a hack like this into DefineIndex or some minion thereof: if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID || opclass == VARCHAR_PATTERN_BTREE_CLASS_OID || opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) && !get_collation_isdeterministic(collid)) ereport(ERROR, ...); Hard-wiring that is ugly; maybe someday we'd wish to expose "doesn't allow nondeterminism" as a more generally-available opclass property. But without some other examples that have a need for it, I think it's not worth the work to create infrastructure for that. It's not like there are no other hard-wired legacy behaviors in DefineIndex... > I notice that there is a hash opclass text_pattern_ops, which I'd > actually never heard of until now, and I don't see documented. What > would we need to do about that? Huh. I wonder why that's there --- I can't see a reason why it'd behave differently from the regular hash text_ops. Maybe we feared that someday it would need to be different? Anyway, I think we can just ignore it for this purpose. > Would it help if one created COLLATE "C" indexes instead of > text_pattern_ops? What are the tradeoffs between the two approaches? Of course, the pattern_ops opclasses long predate our COLLATE support. I suspect if we'd had COLLATE we never would have invented them. I believe the pattern_ops are equivalent to a COLLATE "C" index, both theoretically and in terms of the optimizations we know about making. There might be some minor performance difference from not having to look up the collation, but not much if we've done the collate-is-c fast paths properly. regards, tom lane
pgsql-hackers by date: