Thread: Nondeterministic collations vs. text_pattern_ops

Nondeterministic collations vs. text_pattern_ops

From
Tom Lane
Date:
Whilst poking at the leakproofness-of-texteq issue, I realized
that there's an independent problem caused by the nondeterminism
patch.  To wit, that the text_pattern_ops btree opclass uses
texteq as its equality operator, even though that operator is
no longer guaranteed to be bitwise equality.  That means that
depending on which collation happens to get attached to the
operator, equality might be inconsistent with the other members
of the opclass, leading to who-knows-what bad results.

bpchar_pattern_ops has the same issue with respect to bpchareq.

The obvious fix for this is to invent separate new equality operators,
but that's actually rather disastrous for performance, because
text_pattern_ops indexes would no longer be able to use WHERE clauses
using plain equality.  That also feeds into whether equality clauses
deduced from equivalence classes will work for them (nope, not any
more).  People using such indexes are just about certain to be
bitterly unhappy.

We may not have any choice but to do that, though --- I sure don't
see any other easy fix.  If we could be certain that the collation
attached to the operator is deterministic, then it would still work
with a pattern_ops index, but that's not a concept that the index
infrastructure has got right now.

Whatever we do about this is likely to require a catversion bump,
meaning we've got to fix it *now*.

            regards, tom lane



Re: Nondeterministic collations vs. text_pattern_ops

From
Peter Eisentraut
Date:
On 2019-09-17 01:13, Tom Lane wrote:
> Whilst poking at the leakproofness-of-texteq issue, I realized
> that there's an independent problem caused by the nondeterminism
> patch.  To wit, that the text_pattern_ops btree opclass uses
> texteq as its equality operator, even though that operator is
> no longer guaranteed to be bitwise equality.  That means that
> depending on which collation happens to get attached to the
> operator, equality might be inconsistent with the other members
> of the opclass, leading to who-knows-what bad results.

You can't create a text_pattern_ops index on a column with
nondeterministic collation:

create collation c1 (provider = icu, locale = 'und', deterministic = false);
create table t1 (a int, b text collate c1);
create index on t1 (b text_pattern_ops);
ERROR:  nondeterministic collations are not supported for operator class
"text_pattern_ops"

There is some discussion in internal_text_pattern_compare().

Are there other cases we need to consider?

I notice that there is a hash opclass text_pattern_ops, which I'd
actually never heard of until now, and I don't see documented.  What
would we need to do about that?

> The obvious fix for this is to invent separate new equality operators,
> but that's actually rather disastrous for performance, because
> text_pattern_ops indexes would no longer be able to use WHERE clauses
> using plain equality.  That also feeds into whether equality clauses
> deduced from equivalence classes will work for them (nope, not any
> more).  People using such indexes are just about certain to be
> bitterly unhappy.

Would it help if one created COLLATE "C" indexes instead of
text_pattern_ops?  What are the tradeoffs between the two approaches?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Nondeterministic collations vs. text_pattern_ops

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2019-09-17 01:13, Tom Lane wrote:
>> Whilst poking at the leakproofness-of-texteq issue, I realized
>> that there's an independent problem caused by the nondeterminism
>> patch.  To wit, that the text_pattern_ops btree opclass uses
>> texteq as its equality operator, even though that operator is
>> no longer guaranteed to be bitwise equality.  That means that
>> depending on which collation happens to get attached to the
>> operator, equality might be inconsistent with the other members
>> of the opclass, leading to who-knows-what bad results.

> You can't create a text_pattern_ops index on a column with
> nondeterministic collation:

> create collation c1 (provider = icu, locale = 'und', deterministic = false);
> create table t1 (a int, b text collate c1);
> create index on t1 (b text_pattern_ops);
> ERROR:  nondeterministic collations are not supported for operator class
> "text_pattern_ops"

Oh!  I'd seen that error message, but not realized that it'd get
thrown during index creation.

> There is some discussion in internal_text_pattern_compare().

I don't much like doing it that way: looking up the determinism property
of the collation over again in every single comparison seems pretty
expensive, plus the operator is way exceeding its actual knowledge
of the call context by throwing an error phrased that way.

> Are there other cases we need to consider?

I think that disallowing indexes with this combination of opclass and
collation may actually be sufficient.  A query can request equality
using any collation, but it won't get matched to an index with a
different collation, so I think we're safe against index-related
problems if we have that restriction.

AFAIR, the only other place in the system where non-default opclasses
can be invoked is ORDER BY.  Somebody could write

    ORDER BY textcol COLLATE "nondeterm" USING ~<~

However, I think we're actually okay on that one, because although
the equality opclass member is out of sync with the rest, it won't
get consulted during a sort.  internal_text_pattern_compare will
throw an error for this, but I don't believe it actually needs to.

My recommendation is to get rid of the run-time checks and instead
put a hack like this into DefineIndex or some minion thereof:

    if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID ||
         opclass == VARCHAR_PATTERN_BTREE_CLASS_OID ||
         opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) &&
        !get_collation_isdeterministic(collid))
       ereport(ERROR, ...);

Hard-wiring that is ugly; maybe someday we'd wish to expose "doesn't
allow nondeterminism" as a more generally-available opclass property.
But without some other examples that have a need for it, I think it's
not worth the work to create infrastructure for that.  It's not like
there are no other hard-wired legacy behaviors in DefineIndex...

> I notice that there is a hash opclass text_pattern_ops, which I'd
> actually never heard of until now, and I don't see documented.  What
> would we need to do about that?

Huh.  I wonder why that's there --- I can't see a reason why it'd
behave differently from the regular hash text_ops.  Maybe we feared
that someday it would need to be different?  Anyway, I think we can
just ignore it for this purpose.

> Would it help if one created COLLATE "C" indexes instead of
> text_pattern_ops?  What are the tradeoffs between the two approaches?

Of course, the pattern_ops opclasses long predate our COLLATE support.
I suspect if we'd had COLLATE we never would have invented them.
I believe the pattern_ops are equivalent to a COLLATE "C" index, both
theoretically and in terms of the optimizations we know about making.
There might be some minor performance difference from not having to
look up the collation, but not much if we've done the collate-is-c
fast paths properly.

            regards, tom lane



Re: Nondeterministic collations vs. text_pattern_ops

From
Peter Eisentraut
Date:
On 2019-09-17 17:17, Tom Lane wrote:
> My recommendation is to get rid of the run-time checks and instead
> put a hack like this into DefineIndex or some minion thereof:
> 
>     if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID ||
>          opclass == VARCHAR_PATTERN_BTREE_CLASS_OID ||
>          opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) &&
>         !get_collation_isdeterministic(collid))
>        ereport(ERROR, ...);

Here is a draft patch.

It will require a catversion change because those operator classes don't
have assigned OIDs so far.

The comment block I just moved over for the time being.  It should
probably be rephrased a bit.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Nondeterministic collations vs. text_pattern_ops

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Here is a draft patch.

> It will require a catversion change because those operator classes don't
> have assigned OIDs so far.

That's slightly annoying given where we are with v12.  We could
avoid it by looking up the opclass's opfamily and seeing if it's
TEXT_BTREE_FAM_OID etc, which do already have hand-assigned OIDs.
But maybe avoiding a catversion bump now is not worth the cost of
an extra syscache lookup.  (It'd give me an excuse to shove the
leakproofness-marking changes from the other thread into v12, so
there's that.)

Speaking of extra syscache lookups, I don't like that you rearranged
the if-test to check nondeterminism before the opclass identity checks.
That's usually going to be a wasted lookup.

> The comment block I just moved over for the time being.  It should
> probably be rephrased a bit.

Indeed.  Maybe like

     * text_pattern_ops uses text_eq as the equality operator, which is
     * fine as long as the collation is deterministic; text_eq then
     * reduces to bitwise equality and so it is semantically compatible
     * with the other operators and functions in the opclass.  But with a
     * nondeterministic collation, text_eq could yield results that are
     * incompatible with the actual behavior of the index (which is
     * determined by the opclass's comparison function).  We prevent
     * such problems by refusing creation of an index with this opclass
     * and a nondeterministic collation.
     *
     * The same applies to varchar_pattern_ops and bpchar_pattern_ops.
     * If we find more cases, we might decide to create a real mechanism
     * for marking opclasses as incompatible with nondeterminism; but
     * for now, this small hack suffices.
     *
     * Another solution is to use a special operator, not text_eq, as the
     * equality opclass member, but that is undesirable because it would
     * prevent index usage in many queries that work fine today.

            regards, tom lane



Re: Nondeterministic collations vs. text_pattern_ops

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Here is a draft patch.

Where are we on pushing that?  I'm starting to get antsy about the
amount of time remaining before rc1.  It's a low-risk fix, but still,
it'd be best to have a complete buildfarm cycle on it before Monday's
wrap.

            regards, tom lane



Re: Nondeterministic collations vs. text_pattern_ops

From
Tom Lane
Date:
I wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>> Here is a draft patch.

> Where are we on pushing that?  I'm starting to get antsy about the
> amount of time remaining before rc1.  It's a low-risk fix, but still,
> it'd be best to have a complete buildfarm cycle on it before Monday's
> wrap.

Since time is now really running short, I went ahead and pushed
this, after doing a closer review and finding one nitpicky bug.

            regards, tom lane



Re: Nondeterministic collations vs. text_pattern_ops

From
Peter Eisentraut
Date:
On 2019-09-21 22:30, Tom Lane wrote:
> I wrote:
>> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>>>> Here is a draft patch.
> 
>> Where are we on pushing that?  I'm starting to get antsy about the
>> amount of time remaining before rc1.  It's a low-risk fix, but still,
>> it'd be best to have a complete buildfarm cycle on it before Monday's
>> wrap.
> 
> Since time is now really running short, I went ahead and pushed
> this, after doing a closer review and finding one nitpicky bug.

Great!  I think that covers all the open issues then.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services