Thread: Support LIKE with nondeterministic collations

Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
This patch adds support for using LIKE with nondeterministic collations. 
  So you can do things such as

     col LIKE 'foo%' COLLATE case_insensitive

This currently results in a "not supported" error.  The reason for that 
is that when I first developed support for nondeterministic collations, 
I didn't know what the semantics of that should be, especially since 
with nondeterministic collations, strings of different lengths could be 
equal, and then dropped the issue for a while.

After further research, the SQL standard's definition of the LIKE 
predicate actually provides a clear definition of the semantics: The 
pattern is partitioned into substrings at wildcard characters (so 
'foo%bar' is partitioned into 'foo', '%', 'bar') and then then whole 
predicate matches if a match can be found for each partition under the 
applicable collation (so for 'foo%bar' we look to partition the input 
string into s1 || s2 || s3 such that s1 = 'foo', s2 is anything, and s3 
= 'bar'.)  The only difference to deterministic collations is that for 
deterministic collations we can optimize this by matching by character, 
but for nondeterministic collations we have to go by substring.
Attachment

Re: Support LIKE with nondeterministic collations

From
"Daniel Verite"
Date:
    Peter Eisentraut wrote:

> This patch adds support for using LIKE with nondeterministic
>  collations.  So you can do things such as
>
>     col LIKE 'foo%' COLLATE case_insensitive

Nice!

> The pattern is partitioned into substrings at wildcard characters
> (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then
> whole predicate matches if a match can be found for each partition
> under the applicable collation

Trying with a collation that ignores punctuation:

  postgres=# CREATE COLLATION "ign_punct" (
    provider = 'icu',
    locale='und-u-ka-shifted',
    deterministic = false
  );

  postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct;
   ?column?
  ----------
   t
  (1 row)

  postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct;
   ?column?
  ----------
   t
  (1 row)

  postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
   ?column?
  ----------
   f
  (1 row)

The first two results look fine, but the next one is inconsistent.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 30.04.24 14:39, Daniel Verite wrote:
>    postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
>     ?column?
>    ----------
>     f
>    (1 row)
> 
> The first two results look fine, but the next one is inconsistent.

This is correct, because '_' means "any single character".  This is 
independent of the collation.

I think with nondeterministic collations, the single-character wildcard 
is often not going to be all that useful.




Re: Support LIKE with nondeterministic collations

From
Robert Haas
Date:
On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 30.04.24 14:39, Daniel Verite wrote:
> >    postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
> >     ?column?
> >    ----------
> >     f
> >    (1 row)
> >
> > The first two results look fine, but the next one is inconsistent.
>
> This is correct, because '_' means "any single character".  This is
> independent of the collation.

Seems really counterintuitive. I had to think for a long time to be
able to guess what was happening here. Finally I came up with this
guess:

If the collation-aware matching tries to match up f with the initial
period, the period is skipped and the f matches f. But when the
wildcard is matched to the initial period, that uses up the wildcard
and then we're left trying to match o with f, which doesn't work.

Is that right?

It'd probably be good to use something like this as an example in the
documentation. My intuition is that if foo matches a string, then _oo
f_o and fo_ should also match that string. Apparently that's not the
case, but I doubt I'll be the last one who thinks it should be.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 03.05.24 02:11, Robert Haas wrote:
> On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> On 30.04.24 14:39, Daniel Verite wrote:
>>>     postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
>>>      ?column?
>>>     ----------
>>>      f
>>>     (1 row)
>>>
>>> The first two results look fine, but the next one is inconsistent.
>>
>> This is correct, because '_' means "any single character".  This is
>> independent of the collation.
> 
> Seems really counterintuitive. I had to think for a long time to be
> able to guess what was happening here. Finally I came up with this
> guess:
> 
> If the collation-aware matching tries to match up f with the initial
> period, the period is skipped and the f matches f. But when the
> wildcard is matched to the initial period, that uses up the wildcard
> and then we're left trying to match o with f, which doesn't work.

Formally, what

     X like '_oo'

means is, can X be partitioned into substrings such that the first 
substring is a single character and the second substring is equal to 
'oo' under the applicable collation?  This is false in this case, there 
is no such partitioning.

What the implementation does is, it walks through the pattern.  It sees 
'_', so it steps over one character in the input string, which is '.' 
here.  Then we have 'foo.' left to match in the input string.  Then it 
takes from the pattern the next substring up to but not including either 
a wildcard character or the end of the string, which is 'oo', and then 
it checks if a prefix of the remaining input string can be found that is 
"equal to" 'oo'.  So here it would try in turn

     ''     = 'oo' collate ign_punct ?
     'f'    = 'oo' collate ign_punct ?
     'fo'   = 'oo' collate ign_punct ?
     'foo'  = 'oo' collate ign_punct ?
     'foo.' = 'oo' collate ign_punct ?

and they all fail, so the match fails.

> It'd probably be good to use something like this as an example in the
> documentation. My intuition is that if foo matches a string, then _oo
> f_o and fo_ should also match that string. Apparently that's not the
> case, but I doubt I'll be the last one who thinks it should be.

This intuition fails because with nondeterministic collations, strings 
of different lengths can be equal, and so the question arises, what does 
the pattern '_' mean.  It could mean either, (1) a single character, or 
perhaps something like, (2) a string that is equal to some other string 
of length one.

The second definition would satisfy the expectation here, because then 
'.f' matches '_' because '.f' is equal to some string of length one, 
such as 'f'.  (And then 'oo.' matches 'oo' for the rest of the pattern.) 
  However, off the top of my head, this definition has three flaws: (1) 
It would make the single-character wildcard effectively an 
any-number-of-characters wildcard, but only in some circumstances, which 
could be confusing, (2) it would be difficult to compute, because you'd 
have to check equality against all possible single-character strings, 
and (3) it is not what the SQL standard says.

In any case, yes, some explanation and examples should be added.




Re: Support LIKE with nondeterministic collations

From
Robert Haas
Date:
On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> What the implementation does is, it walks through the pattern.  It sees
> '_', so it steps over one character in the input string, which is '.'
> here.  Then we have 'foo.' left to match in the input string.  Then it
> takes from the pattern the next substring up to but not including either
> a wildcard character or the end of the string, which is 'oo', and then
> it checks if a prefix of the remaining input string can be found that is
> "equal to" 'oo'.  So here it would try in turn
>
>      ''     = 'oo' collate ign_punct ?
>      'f'    = 'oo' collate ign_punct ?
>      'fo'   = 'oo' collate ign_punct ?
>      'foo'  = 'oo' collate ign_punct ?
>      'foo.' = 'oo' collate ign_punct ?
>
> and they all fail, so the match fails.

Interesting. Does that imply that these matches are slower than normal ones?

> The second definition would satisfy the expectation here, because then
> '.f' matches '_' because '.f' is equal to some string of length one,
> such as 'f'.  (And then 'oo.' matches 'oo' for the rest of the pattern.)
>   However, off the top of my head, this definition has three flaws: (1)
> It would make the single-character wildcard effectively an
> any-number-of-characters wildcard, but only in some circumstances, which
> could be confusing, (2) it would be difficult to compute, because you'd
> have to check equality against all possible single-character strings,
> and (3) it is not what the SQL standard says.

Right, those are good arguments.

> In any case, yes, some explanation and examples should be added.

Cool.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 03.05.24 15:20, Robert Haas wrote:
> On Fri, May 3, 2024 at 4:52 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> What the implementation does is, it walks through the pattern.  It sees
>> '_', so it steps over one character in the input string, which is '.'
>> here.  Then we have 'foo.' left to match in the input string.  Then it
>> takes from the pattern the next substring up to but not including either
>> a wildcard character or the end of the string, which is 'oo', and then
>> it checks if a prefix of the remaining input string can be found that is
>> "equal to" 'oo'.  So here it would try in turn
>>
>>       ''     = 'oo' collate ign_punct ?
>>       'f'    = 'oo' collate ign_punct ?
>>       'fo'   = 'oo' collate ign_punct ?
>>       'foo'  = 'oo' collate ign_punct ?
>>       'foo.' = 'oo' collate ign_punct ?
>>
>> and they all fail, so the match fails.
> 
> Interesting. Does that imply that these matches are slower than normal ones?

Yes, certainly, and there is also no indexing support (other than for 
exact matches).




Re: Support LIKE with nondeterministic collations

From
"Daniel Verite"
Date:
Peter Eisentraut wrote:

> Yes, certainly, and there is also no indexing support (other than for
> exact matches).

The ICU docs have this note about prefix matching:


https://unicode-org.github.io/icu/userguide/collation/architecture.html#generating-bounds-for-a-sort-key-prefix-matching

   * Generating bounds for a sort key (prefix matching)

   Having sort keys for strings allows for easy creation of bounds -
   sort keys that are guaranteed to be smaller or larger than any sort
   key from a give range. For example, if bounds are produced for a
   sortkey of string “smith”, strings between upper and lower bounds
   with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
   of upper bounds can be generated - the first one will match only
   strings of equal length, while the second one will match all the
   strings with the same initial prefix.

   CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
   the maximum primary weight, so that for example the string
   “smith\uFFFF” can be used as the upper bound rather than modifying
   the sort key for “smith”.

In other words it says that

  col LIKE 'smith%' collate "nd"

is equivalent to:

  col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

which could be obtained from an index scan, assuming a btree
index on "col" collate "nd".

U+FFFF is a valid code point but a "non-character" [1] so it's
not supposed to be present in normal strings.

[1] https://www.unicode.org/glossary/#noncharacter


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Support LIKE with nondeterministic collations

From
"Daniel Verite"
Date:
    Peter Eisentraut wrote:

>  However, off the top of my head, this definition has three flaws: (1)
> It would make the single-character wildcard effectively an
> any-number-of-characters wildcard, but only in some circumstances, which
> could be confusing, (2) it would be difficult to compute, because you'd
> have to check equality against all possible single-character strings,
> and (3) it is not what the SQL standard says.

For #1 we're currently using the definition of a "character" as
being any single point of code, but this definition fits poorly
with non-deterministic collation rules.

The simplest illustration I can think of is the canonical
equivalence match between the NFD and NFC forms of an
accented character.

postgres=# CREATE COLLATION nd (
  provider = 'icu',
  locale = 'und',
  deterministic = false
);

-- match NFD form with NFC form of eacute

postgres=# select U&'e\0301' like 'é' collate nd;
 ?column?
----------
 t

postgres=# select U&'e\0301' like '_' collate nd;
 ?column?
----------
 f
(1 row)

I understand why the algorithm produces these opposite results.
But at the semantic level, when asked if the left-hand string matches
a specific character, it says yes, and when asked if it matches any
character, it says no.
To me it goes beyond counter-intuitive, it's not reasonable enough to
be called correct.

What could we do about it?
Intuitively I think that our interpretation of "character" here should
be whatever sequence of code points are between character
boundaries [1], and that the equality of such characters would be the
equality of their sequences of code points, with the string equality
check of the collation, whatever the length of these sequences.

[1]:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 03.05.24 16:58, Daniel Verite wrote:
>     * Generating bounds for a sort key (prefix matching)
> 
>     Having sort keys for strings allows for easy creation of bounds -
>     sort keys that are guaranteed to be smaller or larger than any sort
>     key from a give range. For example, if bounds are produced for a
>     sortkey of string “smith”, strings between upper and lower bounds
>     with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
>     of upper bounds can be generated - the first one will match only
>     strings of equal length, while the second one will match all the
>     strings with the same initial prefix.
> 
>     CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
>     the maximum primary weight, so that for example the string
>     “smith\uFFFF” can be used as the upper bound rather than modifying
>     the sort key for “smith”.
> 
> In other words it says that
> 
>    col LIKE 'smith%' collate "nd"
> 
> is equivalent to:
> 
>    col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
> 
> which could be obtained from an index scan, assuming a btree
> index on "col" collate "nd".
> 
> U+FFFF is a valid code point but a "non-character" [1] so it's
> not supposed to be present in normal strings.

Thanks, this could be very useful!




Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 03.05.24 17:47, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>>   However, off the top of my head, this definition has three flaws: (1)
>> It would make the single-character wildcard effectively an
>> any-number-of-characters wildcard, but only in some circumstances, which
>> could be confusing, (2) it would be difficult to compute, because you'd
>> have to check equality against all possible single-character strings,
>> and (3) it is not what the SQL standard says.
> 
> For #1 we're currently using the definition of a "character" as
> being any single point of code,

That is the definition that is used throughout SQL and PostgreSQL.  We 
can't change that without redefining everything.  To pick just one 
example, the various trim function also behave in seemingly inconsistent 
ways when you apply then to strings in different normalization forms. 
The better fix there is to enforce the normalization form somehow.

> Intuitively I think that our interpretation of "character" here should
> be whatever sequence of code points are between character
> boundaries [1], and that the equality of such characters would be the
> equality of their sequences of code points, with the string equality
> check of the collation, whatever the length of these sequences.
> 
> [1]:
> https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Even that page says, what we are calling character here is really called 
a grapheme cluster.

In a different world, pattern matching, character trimming, etc. would 
work by grapheme, but it does not.




Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
Here is an updated patch for this.

I have added some more documentation based on the discussions, including 
some examples taken directly from the emails here.

One thing I have been struggling with a bit is the correct use of 
LIKE_FALSE versus LIKE_ABORT in the MatchText() code.  I have made some 
small tweaks about this in this version that I think are more correct, 
but it could use another look.  Maybe also some more tests to verify 
this one way or the other.


On 30.04.24 14:39, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> This patch adds support for using LIKE with nondeterministic
>>   collations.  So you can do things such as
>>
>>      col LIKE 'foo%' COLLATE case_insensitive
> 
> Nice!
> 
>> The pattern is partitioned into substrings at wildcard characters
>> (so 'foo%bar' is partitioned into 'foo', '%', 'bar') and then then
>> whole predicate matches if a match can be found for each partition
>> under the applicable collation
> 
> Trying with a collation that ignores punctuation:
> 
>    postgres=# CREATE COLLATION "ign_punct" (
>      provider = 'icu',
>      locale='und-u-ka-shifted',
>      deterministic = false
>    );
> 
>    postgres=# SELECT '.foo.' like 'foo' COLLATE ign_punct;
>     ?column?
>    ----------
>     t
>    (1 row)
> 
>    postgres=# SELECT '.foo.' like 'f_o' COLLATE ign_punct;
>     ?column?
>    ----------
>     t
>    (1 row)
> 
>    postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
>     ?column?
>    ----------
>     f
>    (1 row)
> 
> The first two results look fine, but the next one is inconsistent.

Attachment

Re: Support LIKE with nondeterministic collations

From
Paul A Jungwirth
Date:
On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> Here is an updated patch for this.

I took a look at this. I added some tests and found a few that give
the wrong result (I believe). The new tests are included in the
attached patch, along with the results I expect. Here are the
failures:

 -- inner %% matches b then zero:
 SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)

 -- trailing _ matches two codepoints that form one char:
 SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)

-- leading % matches zero:
 SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)

 -- leading % matches zero (with later %):
 SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
  ?column?
 ----------
- t
+ f
 (1 row)

I think the 1st, 3rd, and 4th failures are all from % not backtracking
to match zero chars.

The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
then why does the same test pass with __ leading not trailing? Surely
they should be consistent.

> I have added some more documentation based on the discussions, including
> some examples taken directly from the emails here.

This looks good to me.

> One thing I have been struggling with a bit is the correct use of
> LIKE_FALSE versus LIKE_ABORT in the MatchText() code.  I have made some
> small tweaks about this in this version that I think are more correct,
> but it could use another look.  Maybe also some more tests to verify
> this one way or the other.

I haven't looked at this yet.

Yours,

--
Paul              ~{:-)
pj@illuminatedcomputing.com

Attachment

Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 27.07.24 00:32, Paul A Jungwirth wrote:
> On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>> Here is an updated patch for this.
> 
> I took a look at this. I added some tests and found a few that give
> the wrong result (I believe). The new tests are included in the
> attached patch, along with the results I expect. Here are the
> failures:

Thanks, these are great test cases.

> 
>   -- inner %% matches b then zero:
>   SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
>    ?column?
>   ----------
> - t
> + f
>   (1 row)
> 
>   -- trailing _ matches two codepoints that form one char:
>   SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
>    ?column?
>   ----------
> - t
> + f
>   (1 row)
> 
> -- leading % matches zero:
>   SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
>    ?column?
>   ----------
> - t
> + f
>   (1 row)
> 
>   -- leading % matches zero (with later %):
>   SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
>    ?column?
>   ----------
> - t
> + f
>   (1 row)
> 
> I think the 1st, 3rd, and 4th failures are all from % not backtracking
> to match zero chars.

These are all because of this in like_match.c:

        * Otherwise, scan for a text position at which we can match the
        * rest of the pattern.  The first remaining pattern char is known
        * to be a regular or escaped literal character, so we can compare
        * the first pattern byte to each text byte to avoid recursing
        * more than we have to.  [...]

This shortcut doesn't work with nondeterministic collations, so we have 
to recurse in any case here.  I have fixed that in the new patch; these 
test cases work now.

> The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
> then why does the same test pass with __ leading not trailing? Surely
> they should be consistent.

The question is why is

SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;  -- false

but

SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents;  -- true

The second one matches because

     SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;

So the accent character will be ignored if it's adjacent to another 
fixed substring in the pattern.

Attachment

Re: Support LIKE with nondeterministic collations

From
Jeff Davis
Date:
On Fri, 2024-05-03 at 16:58 +0200, Daniel Verite wrote:
>    * Generating bounds for a sort key (prefix matching)
>
>    Having sort keys for strings allows for easy creation of bounds -
>    sort keys that are guaranteed to be smaller or larger than any
> sort
>    key from a give range. For example, if bounds are produced for a
>    sortkey of string “smith”, strings between upper and lower bounds
>    with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
>    of upper bounds can be generated - the first one will match only
>    strings of equal length, while the second one will match all the
>    strings with the same initial prefix.
>
>    CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
>    the maximum primary weight, so that for example the string
>    “smith\uFFFF” can be used as the upper bound rather than modifying
>    the sort key for “smith”.
>
> In other words it says that
>
>   col LIKE 'smith%' collate "nd"
>
> is equivalent to:
>
>   col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

That logic seems to assume something about the collation. If you have a
collation that orders strings by their sha256 hash, that would entirely
break the connection between prefixes and ranges, and it wouldn't work.

Is there something about the way collations are defined that inherently
maintains a connection between a prefix and a range? Does it remain
true even when strange rules are added to a collation?

Regards,
    Jeff Davis




Re: Support LIKE with nondeterministic collations

From
"Daniel Verite"
Date:
    Jeff Davis wrote:

> >   col LIKE 'smith%' collate "nd"
> >
> > is equivalent to:
> >
> >   col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
>
> That logic seems to assume something about the collation. If you have a
> collation that orders strings by their sha256 hash, that would entirely
> break the connection between prefixes and ranges, and it wouldn't work.

Indeed I would not trust this trick to work outside of ICU collations.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: Support LIKE with nondeterministic collations

From
Jacob Champion
Date:
On Sun, Sep 15, 2024 at 11:26 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> Here is an updated patch.  It is rebased over the various recent changes
> in the locale APIs.  No other changes.

libfuzzer is unhappy about the following code in MatchText:

> +            while (p1len > 0)
> +            {
> +                if (*p1 == '\\')
> +                {
> +                    found_escape = true;
> +                    NextByte(p1, p1len);
> +                }
> +                else if (*p1 == '_' || *p1 == '%')
> +                    break;
> +                NextByte(p1, p1len);
> +            }

If the pattern ends with a backslash, we'll call NextByte() twice,
p1len will wrap around to INT_MAX, and we'll walk off the end of the
buffer. (I fixed it locally by duplicating the ERROR case that's
directly above this.)

So far that's the only thing reported, but fuzzing is slow. The fuzzer
is incentivized to find more and more horrible call stacks, which in
this case means it's finding inefficient patterns with a lot of
backtracking. (Performance drops from 25000+ iterations per second, to
roughly 50 per second, pretty quickly, and that's not fast enough to
make good progress.) I haven't dug in yet to see whether there are
optimizations that would avoid the worst cases.

Thanks,
--Jacob



Re: Support LIKE with nondeterministic collations

From
Heikki Linnakangas
Date:
On 04/11/2024 10:26, Peter Eisentraut wrote:
> On 29.10.24 18:15, Jacob Champion wrote:
>> libfuzzer is unhappy about the following code in MatchText:
>>
>>> +            while (p1len > 0)
>>> +            {
>>> +                if (*p1 == '\\')
>>> +                {
>>> +                    found_escape = true;
>>> +                    NextByte(p1, p1len);
>>> +                }
>>> +                else if (*p1 == '_' || *p1 == '%')
>>> +                    break;
>>> +                NextByte(p1, p1len);
>>> +            }
>>
>> If the pattern ends with a backslash, we'll call NextByte() twice,
>> p1len will wrap around to INT_MAX, and we'll walk off the end of the
>> buffer. (I fixed it locally by duplicating the ERROR case that's
>> directly above this.)
> 
> Thanks.  Here is an updated patch with that fixed.

Sadly the algorithm is O(n^2) with non-deterministic collations.Is there 
any way this could be optimized? We make no claims on how expensive any 
functions or operators are, so I suppose a slow implementation is 
nevertheless better than throwing an error.

Let's at least add some CHECK_FOR_INTERRUPTS(). For example, this takes 
a very long time and is uninterruptible:

  SELECT repeat('x', 100000) LIKE '%xxxy%' COLLATE ignore_accents;

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Support LIKE with nondeterministic collations

From
jian he
Date:
On Tue, Nov 12, 2024 at 3:45 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 11.11.24 14:25, Heikki Linnakangas wrote:
> > Sadly the algorithm is O(n^2) with non-deterministic collations.Is there
> > any way this could be optimized? We make no claims on how expensive any
> > functions or operators are, so I suppose a slow implementation is
> > nevertheless better than throwing an error.
>
> Yeah, maybe someone comes up with new ideas in the future.
>

/*
* Now build a substring of the text and try to match it against
* the subpattern.  t is the start of the text, t1 is one past the
* last byte.  We start with a zero-length string.
*/
t1 = t
t1len = tlen;
for (;;)
{
int cmp;
CHECK_FOR_INTERRUPTS();
cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);

select '.foo.' LIKE '_oo' COLLATE ign_punct;
pg_strncoll's iteration of the first 4 argument values.
oo      2       foo. 0
oo      2       foo. 1
oo      2       foo. 2
oo      2       foo. 3
oo      2       foo. 4

seems there is a shortcut/optimization.
if subpat don't have wildcard(percent sign, underscore)
then we can have less pg_strncoll calls?


minimum case to trigger error within GenericMatchText
since no related tests.
create table t1(a text collate case_insensitive, b text collate "C");
insert into t1 values ('a','a');
select a like b from t1;


at 9.7.1. LIKE  section, we still don't know what "wildcard" is.
we mentioned it at 9.7.2.
maybe we can add a sentence at the end of:
    <para>
     If <replaceable>pattern</replaceable> does not contain percent
     signs or underscores, then the pattern only represents the string
     itself; in that case <function>LIKE</function> acts like the
     equals operator.  An underscore (<literal>_</literal>) in
     <replaceable>pattern</replaceable> stands for (matches) any single
     character; a percent sign (<literal>%</literal>) matches any sequence
     of zero or more characters.
    </para>

saying underscore and percent sign are wildcards in LIKE.
other than that, I can understand the doc.



Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 15.11.24 05:26, jian he wrote:
> /*
> * Now build a substring of the text and try to match it against
> * the subpattern.  t is the start of the text, t1 is one past the
> * last byte.  We start with a zero-length string.
> */
> t1 = t
> t1len = tlen;
> for (;;)
> {
> int cmp;
> CHECK_FOR_INTERRUPTS();
> cmp = pg_strncoll(subpat, subpatlen, t, (t1 - t), locale);
> 
> select '.foo.' LIKE '_oo' COLLATE ign_punct;
> pg_strncoll's iteration of the first 4 argument values.
> oo      2       foo. 0
> oo      2       foo. 1
> oo      2       foo. 2
> oo      2       foo. 3
> oo      2       foo. 4
> 
> seems there is a shortcut/optimization.
> if subpat don't have wildcard(percent sign, underscore)
> then we can have less pg_strncoll calls?

How would you do that?  You need to try all combinations to find one 
that matches.

> minimum case to trigger error within GenericMatchText
> since no related tests.
> create table t1(a text collate case_insensitive, b text collate "C");
> insert into t1 values ('a','a');
> select a like b from t1;

This results in

ERROR:  42P22: could not determine which collation to use for LIKE
HINT:  Use the COLLATE clause to set the collation explicitly.

which is the expected behavior.

> at 9.7.1. LIKE  section, we still don't know what "wildcard" is.
> we mentioned it at 9.7.2.
> maybe we can add a sentence at the end of:
>      <para>
>       If <replaceable>pattern</replaceable> does not contain percent
>       signs or underscores, then the pattern only represents the string
>       itself; in that case <function>LIKE</function> acts like the
>       equals operator.  An underscore (<literal>_</literal>) in
>       <replaceable>pattern</replaceable> stands for (matches) any single
>       character; a percent sign (<literal>%</literal>) matches any sequence
>       of zero or more characters.
>      </para>
> 
> saying underscore and percent sign are wildcards in LIKE.
> other than that, I can understand the doc.

Ok, I agree that could be clarified.




Re: Support LIKE with nondeterministic collations

From
jian he
Date:
On Tue, Nov 19, 2024 at 9:51 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 18.11.24 04:30, jian he wrote:
> > we can optimize when trailing (last character) is not  wildcards.
> >
> > SELECT 'Ha12foo' LIKE '%foo' COLLATE ignore_accents;
> > within the for loop
> > for(;;)
> > {
> > int            cmp;
> > CHECK_FOR_INTERRUPTS();
> > ....
> > }
> >
> > pg_strncoll comparison will become
> > Ha12foo    foo
> > a12foo      foo
> > 12foo        foo
> > 2foo          foo
> > foo            foo
> >
> > it's safe because in MatchText we have:
> > else if (*p == '%')
> > {
> > while (tlen > 0)
> > {
> >      if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
> >      {
> >          int            matched = MatchText(t, tlen, p, plen, locale);
> >          if (matched != LIKE_FALSE)
> >              return matched; /* TRUE or ABORT */
> >      }
> >      NextChar(t, tlen);
> > }
> > }
> >
> > please check attached.
>
> I see, good idea.  I implemented it a bit differently.  See "Shortcut:
> If this is the end of the pattern ..." in this patch.  Please check if
> this is what you had in mind.

your implementation is far more simpler than mine.
I think I understand it.

i am trying to optimize case where pattern is begin_with like `pattern%`
but failed on case like:
SELECT U&'\0061\0308bc' LIKE U&'\00E4bc%' COLLATE ignore_accents;
basically the_string like the_pattern%. the length of the_string  and
length of the_pattern
can vary, we can not just do one pg_strncoll.


in match_pattern_prefix maybe change
    if (expr_coll && !get_collation_isdeterministic(expr_coll))
        return NIL;
to
    if (OidIsValid(expr_coll) && !get_collation_isdeterministic(expr_coll))
        return NIL;

other than that, I didn't find any issue.



Re: Support LIKE with nondeterministic collations

From
Peter Eisentraut
Date:
On 20.11.24 08:29, jian he wrote:
> in match_pattern_prefix maybe change
>      if (expr_coll && !get_collation_isdeterministic(expr_coll))
>          return NIL;
> to
>      if (OidIsValid(expr_coll) && !get_collation_isdeterministic(expr_coll))
>          return NIL;

I left it like it was, because this was existing code that I'm just 
moving around.

> other than that, I didn't find any issue.

I have committed it, thanks.