Thread: indexable and locale

indexable and locale

From
Goran Thyni
Date:
Hello again,
I thought I should start making some small contibutions before 7.0.

Attached is a patch to the old problem discussed feverly before 6.5.
What is does:
for locale-enabled servers:
    use index if last char before '%' is ascii.
for non-locale servers:
    do not use locale if last char is non-ascii since it is wrong anyway.

Comments?

regards,
--
-----------------
Göran Thyni
On quiet nights you can hear Windows NT reboot!diff -c pgsql/src/backend/optimizer/path/indxpath.c
work/pgsql/src/backend/optimizer/path/indxpath.c
*** pgsql/src/backend/optimizer/path/indxpath.c    Wed Oct  6 18:33:57 1999
--- work/pgsql/src/backend/optimizer/path/indxpath.c    Fri Oct 15 19:54:34 1999
***************
*** 1934,1968 ****
      op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
      expr = make_opclause(op, leftop, (Var *) con);
      result = lcons(expr, NIL);
-
      /*
!      * In ASCII locale we say "x <= prefix\377".  This does not
!      * work for non-ASCII collation orders, and it's not really
!      * right even for ASCII.  FIX ME!
!      * Note we assume the passed prefix string is workspace with
!      * an extra byte, as created by the xxx_fixed_prefix routines above.
       */
! #ifndef USE_LOCALE
!     prefixlen = strlen(prefix);
!     prefix[prefixlen] = '\377';
!     prefix[prefixlen+1] = '\0';
!
!     optup = SearchSysCacheTuple(OPRNAME,
!                                 PointerGetDatum("<="),
!                                 ObjectIdGetDatum(datatype),
!                                 ObjectIdGetDatum(datatype),
!                                 CharGetDatum('b'));
!     if (!HeapTupleIsValid(optup))
!         elog(ERROR, "prefix_quals: no <= operator for type %u", datatype);
!     conval = (datatype == NAMEOID) ?
!         (void*) namein(prefix) : (void*) textin(prefix);
!     con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1),
!                     PointerGetDatum(conval),
!                     false, false, false, false);
!     op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
!     expr = make_opclause(op, leftop, (Var *) con);
!     result = lappend(result, expr);
! #endif
!
      return result;
  }
--- 1934,1970 ----
      op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
      expr = make_opclause(op, leftop, (Var *) con);
      result = lcons(expr, NIL);
      /*
!      * If last is in ascii range make it indexable,
!      * else let it be.
!      * FIXME: find way to use locate for this to support
!      *        indexing of non-ascii characters.
       */
!     prefixlen = strlen(prefix) - 1;
!     elog(DEBUG, "XXX1 %s", prefix);
!     if ((unsigned) prefix[prefixlen] < 126)
!       {
!         prefix[prefixlen]++;
!         elog(DEBUG, "XXX2 %s", prefix);
!         optup = SearchSysCacheTuple(OPRNAME,
!                     PointerGetDatum("<="),
!                     ObjectIdGetDatum(datatype),
!                     ObjectIdGetDatum(datatype),
!                     CharGetDatum('b'));
!         if (!HeapTupleIsValid(optup))
!           elog(ERROR, "prefix_quals: no <= operator for type %u", datatype);
!         conval = (datatype == NAMEOID) ?
!           (void*) namein(prefix) : (void*) textin(prefix);
!         con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1),
!                 PointerGetDatum(conval),
!                 false, false, false, false);
!         op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
!         expr = make_opclause(op, leftop, (Var *) con);
!         result = lappend(result, expr);
!       }
      return result;
  }

Attachment

Re: [HACKERS] indexable and locale

From
Tatsuo Ishii
Date:
> Hello again,
> I thought I should start making some small contibutions before 7.0.
> 
> Attached is a patch to the old problem discussed feverly before 6.5.
> What is does:
> for locale-enabled servers: 
>     use index if last char before '%' is ascii.
> for non-locale servers: 
>     do not use locale if last char is non-ascii since it is wrong anyway.
> 
> Comments?         

I tried your patches but it seems malformed:
patch: **** unexpected end of file in patch

So this is a guess from reading them. I think your pacthes break
non-ascii multi-byte character sets data and should be surrounded by
#ifdef LOCALE rather than replacing current codes surrounded by
#ifndef LOCALE.
---
Tatsuo Ishii



Re: [HACKERS] indexable and locale

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Attached is a patch to the old problem discussed feverly before 6.5.

> ... I think your pacthes break
> non-ascii multi-byte character sets data and should be surrounded by
> #ifdef LOCALE rather than replacing current codes surrounded by
> #ifndef LOCALE.

I am worried about this patch too.  Under MULTIBYTE could it
generate invalid characters?  Also, do all non-ASCII locales sort
codes 0-126 in the same order as ASCII?  I didn't think they do,
but I'm not an expert.

The approach I was considering for fixing the problem was to use a
loop that would repeatedly try to generate a string greater than the
prefix string.  The basic loop step would increment the rightmost
byte as Goran has done (or, if it's already up to the limit, chop
it off and increment the next character position).  Then test to
see whether the '<' operator actually believes the result is
greater than the given prefix, and repeat if not.  This avoids making
any strong assumptions about the sort order of different character
codes.  However, there are two significant issues that would have
to be surmounted to make it work reliably:

1. In MULTIBYTE mode incrementing the rightmost byte might yield
an illegal multibyte character.  Some way to prevent or detect this
would be needed, lest it confuse the comparison operator.  I think
we have some multibyte routines that could be used to check for
a valid result, but I haven't looked into it.

2. I think there are some locales out there that have context-
sensitive sorting rules, ie, a given character string may sort
differently than you'd expect from considering the characters in
isolation.  For example, in German isn't "ss" treated specially?
If "pqrss" does not sort between "pqrs" and "pqrt" then the entire
premise of *both* sides of the LIKE optimization falls apart,
because you can't be sure what will happen when comparing a prefix
string like "pqrs" against longer strings from the database.
I do not know if this is really a problem, nor what we could do
to avoid it if it is.
        regards, tom lane


Re: [HACKERS] indexable and locale

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> >> Attached is a patch to the old problem discussed feverly before 6.5.
> 
> > ... I think your pacthes break
> > non-ascii multi-byte character sets data and should be surrounded by
> > #ifdef LOCALE rather than replacing current codes surrounded by
> > #ifndef LOCALE.
> 
> I am worried about this patch too.  Under MULTIBYTE could it
> generate invalid characters?

I assume you are talking about following code fragment in the pacthes:
    prefix[prefixlen]++;

This would not generate invalid characters under MULTIBYTE since it skips the 
multi-byte characters by:
    if ((unsigned) prefix[prefixlen] < 126)

This would not make non-ASCII multi-byte characters indexable,
however.

> Also, do all non-ASCII locales sort
> codes 0-126 in the same order as ASCII?  I didn't think they do,
> but I'm not an expert.

As far as I know they do. At least all encodings MULTIBYTE mode can
handle have same code point as ASCII in 0-126 range. They have
following characteristics:

o code point 0x00-0x7f are compatible with ASCII.

o code point over 0x80 are variable length multi-byte characters. For example, ISO-8859-1 (Germany, Fernch etc...) has
themulti-byte length to always 1, while EUC_JP (Japanese) has 2 to 3.
 

> The approach I was considering for fixing the problem was to use a
> loop that would repeatedly try to generate a string greater than the
> prefix string.  The basic loop step would increment the rightmost
> byte as Goran has done (or, if it's already up to the limit, chop
> it off and increment the next character position).  Then test to
> see whether the '<' operator actually believes the result is
> greater than the given prefix, and repeat if not.  This avoids making
> any strong assumptions about the sort order of different character
> codes.  However, there are two significant issues that would have
> to be surmounted to make it work reliably:

Sounds good idea.

> 1. In MULTIBYTE mode incrementing the rightmost byte might yield
> an illegal multibyte character.  Some way to prevent or detect this
> would be needed, lest it confuse the comparison operator.  I think
> we have some multibyte routines that could be used to check for
> a valid result, but I haven't looked into it.

I don't think this is an issue as long as locale isn't enabled. For
multibyte encodings (Japanese, Chinese etc..) locale is totally
useless and usually I don't enable it.

> 2. I think there are some locales out there that have context-
> sensitive sorting rules, ie, a given character string may sort
> differently than you'd expect from considering the characters in
> isolation.  For example, in German isn't "ss" treated specially?
> If "pqrss" does not sort between "pqrs" and "pqrt" then the entire
> premise of *both* sides of the LIKE optimization falls apart,
> because you can't be sure what will happen when comparing a prefix
> string like "pqrs" against longer strings from the database.
> I do not know if this is really a problem, nor what we could do
> to avoid it if it is.

I'm not sure about it but I am afraid it could be a problem. I think
real soultion would be supporting the standard CREATE COLLATION.
---
Tatsuo Ishii



Re: [HACKERS] indexable and locale

From
Bruce Momjian
Date:
> > Hello again,
> > I thought I should start making some small contibutions before 7.0.
> > 
> > Attached is a patch to the old problem discussed feverly before 6.5.
> > What is does:
> > for locale-enabled servers: 
> >     use index if last char before '%' is ascii.
> > for non-locale servers: 
> >     do not use locale if last char is non-ascii since it is wrong anyway.
> > 
> > Comments?         
> 
> I tried your patches but it seems malformed:
> 
>     patch: **** unexpected end of file in patch


Yes, I had to apply it manually.

> So this is a guess from reading them. I think your pacthes break
> non-ascii multi-byte character sets data and should be surrounded by
> #ifdef LOCALE rather than replacing current codes surrounded by
> #ifndef LOCALE.

Can you supply a patch against the current tree?  I don't understand
this. Thanks.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: indexable and locale

From
Bruce Momjian
Date:
Applied.


[Charset iso-8859-1 unsupported, filtering to ASCII...]
> Hello again,
> I thought I should start making some small contibutions before 7.0.
> 
> Attached is a patch to the old problem discussed feverly before 6.5.
> What is does:
> for locale-enabled servers: 
>     use index if last char before '%' is ascii.
> for non-locale servers: 
>     do not use locale if last char is non-ascii since it is wrong anyway.
> 
> Comments?         
> 
> regards,
> -- 
> -----------------
> G_ran Thyni
> On quiet nights you can hear Windows NT reboot!

> diff -c pgsql/src/backend/optimizer/path/indxpath.c work/pgsql/src/backend/optimizer/path/indxpath.c
> *** pgsql/src/backend/optimizer/path/indxpath.c    Wed Oct  6 18:33:57 1999
> --- work/pgsql/src/backend/optimizer/path/indxpath.c    Fri Oct 15 19:54:34 1999
> ***************
> *** 1934,1968 ****
>       op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
>       expr = make_opclause(op, leftop, (Var *) con);
>       result = lcons(expr, NIL);
> - 
>       /*
> !      * In ASCII locale we say "x <= prefix\377".  This does not
> !      * work for non-ASCII collation orders, and it's not really
> !      * right even for ASCII.  FIX ME!
> !      * Note we assume the passed prefix string is workspace with
> !      * an extra byte, as created by the xxx_fixed_prefix routines above.
>        */
> ! #ifndef USE_LOCALE
> !     prefixlen = strlen(prefix);
> !     prefix[prefixlen] = '\377';
> !     prefix[prefixlen+1] = '\0';
> ! 
> !     optup = SearchSysCacheTuple(OPRNAME,
> !                                 PointerGetDatum("<="),
> !                                 ObjectIdGetDatum(datatype),
> !                                 ObjectIdGetDatum(datatype),
> !                                 CharGetDatum('b'));
> !     if (!HeapTupleIsValid(optup))
> !         elog(ERROR, "prefix_quals: no <= operator for type %u", datatype);
> !     conval = (datatype == NAMEOID) ?
> !         (void*) namein(prefix) : (void*) textin(prefix);
> !     con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1),
> !                     PointerGetDatum(conval),
> !                     false, false, false, false);
> !     op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
> !     expr = make_opclause(op, leftop, (Var *) con);
> !     result = lappend(result, expr);
> ! #endif
> ! 
>       return result;
>   }
> --- 1934,1970 ----
>       op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
>       expr = make_opclause(op, leftop, (Var *) con);
>       result = lcons(expr, NIL);
>       /*
> !      * If last is in ascii range make it indexable,
> !      * else let it be.
> !      * FIXME: find way to use locate for this to support
> !      *        indexing of non-ascii characters.
>        */
> !     prefixlen = strlen(prefix) - 1;
> !     elog(DEBUG, "XXX1 %s", prefix);
> !     if ((unsigned) prefix[prefixlen] < 126)
> !       {
> !         prefix[prefixlen]++;
> !         elog(DEBUG, "XXX2 %s", prefix);
> !         optup = SearchSysCacheTuple(OPRNAME,
> !                     PointerGetDatum("<="),
> !                     ObjectIdGetDatum(datatype),
> !                     ObjectIdGetDatum(datatype),
> !                     CharGetDatum('b'));
> !         if (!HeapTupleIsValid(optup))
> !           elog(ERROR, "prefix_quals: no <= operator for type %u", datatype);
> !         conval = (datatype == NAMEOID) ?
> !           (void*) namein(prefix) : (void*) textin(prefix);
> !         con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1),
> !                 PointerGetDatum(conval),
> !                 false, false, false, false);
> !         op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
> !         expr = make_opclause(op, leftop, (Var *) con);
> !         result = lappend(result, expr);
> !       }
>       return result;
>   }

[application/x-gzip is not supported, skipping...]


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: indexable and locale

From
Bruce Momjian
Date:
Sorry, found messages of people objecting to the patch.  Patch reversed
out.


[Charset iso-8859-1 unsupported, filtering to ASCII...]
> Hello again,
> I thought I should start making some small contibutions before 7.0.
> 
> Attached is a patch to the old problem discussed feverly before 6.5.
> What is does:
> for locale-enabled servers: 
>     use index if last char before '%' is ascii.
> for non-locale servers: 
>     do not use locale if last char is non-ascii since it is wrong anyway.
> 
> Comments?         
> 
> regards,
> -- 
> -----------------
> G_ran Thyni
> On quiet nights you can hear Windows NT reboot!

> diff -c pgsql/src/backend/optimizer/path/indxpath.c work/pgsql/src/backend/optimizer/path/indxpath.c
> *** pgsql/src/backend/optimizer/path/indxpath.c    Wed Oct  6 18:33:57 1999
> --- work/pgsql/src/backend/optimizer/path/indxpath.c    Fri Oct 15 19:54:34 1999
> ***************
> *** 1934,1968 ****
>       op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
>       expr = make_opclause(op, leftop, (Var *) con);
>       result = lcons(expr, NIL);
> - 
>       /*
> !      * In ASCII locale we say "x <= prefix\377".  This does not
> !      * work for non-ASCII collation orders, and it's not really
> !      * right even for ASCII.  FIX ME!
> !      * Note we assume the passed prefix string is workspace with
> !      * an extra byte, as created by the xxx_fixed_prefix routines above.
>        */
> ! #ifndef USE_LOCALE
> !     prefixlen = strlen(prefix);
> !     prefix[prefixlen] = '\377';
> !     prefix[prefixlen+1] = '\0';
> ! 
> !     optup = SearchSysCacheTuple(OPRNAME,
> !                                 PointerGetDatum("<="),
> !                                 ObjectIdGetDatum(datatype),
> !                                 ObjectIdGetDatum(datatype),
> !                                 CharGetDatum('b'));
> !     if (!HeapTupleIsValid(optup))
> !         elog(ERROR, "prefix_quals: no <= operator for type %u", datatype);
> !     conval = (datatype == NAMEOID) ?
> !         (void*) namein(prefix) : (void*) textin(prefix);
> !     con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1),
> !                     PointerGetDatum(conval),
> !                     false, false, false, false);
> !     op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
> !     expr = make_opclause(op, leftop, (Var *) con);
> !     result = lappend(result, expr);
> ! #endif
> ! 
>       return result;
>   }
> --- 1934,1970 ----
>       op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
>       expr = make_opclause(op, leftop, (Var *) con);
>       result = lcons(expr, NIL);
>       /*
> !      * If last is in ascii range make it indexable,
> !      * else let it be.
> !      * FIXME: find way to use locate for this to support
> !      *        indexing of non-ascii characters.
>        */
> !     prefixlen = strlen(prefix) - 1;
> !     elog(DEBUG, "XXX1 %s", prefix);
> !     if ((unsigned) prefix[prefixlen] < 126)
> !       {
> !         prefix[prefixlen]++;
> !         elog(DEBUG, "XXX2 %s", prefix);
> !         optup = SearchSysCacheTuple(OPRNAME,
> !                     PointerGetDatum("<="),
> !                     ObjectIdGetDatum(datatype),
> !                     ObjectIdGetDatum(datatype),
> !                     CharGetDatum('b'));
> !         if (!HeapTupleIsValid(optup))
> !           elog(ERROR, "prefix_quals: no <= operator for type %u", datatype);
> !         conval = (datatype == NAMEOID) ?
> !           (void*) namein(prefix) : (void*) textin(prefix);
> !         con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1),
> !                 PointerGetDatum(conval),
> !                 false, false, false, false);
> !         op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL);
> !         expr = make_opclause(op, leftop, (Var *) con);
> !         result = lappend(result, expr);
> !       }
>       return result;
>   }

[application/x-gzip is not supported, skipping...]


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] indexable and locale

From
Bruce Momjian
Date:
Here is Tom's comment on the patch.

> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> >> Attached is a patch to the old problem discussed feverly before 6.5.
> 
> > ... I think your pacthes break
> > non-ascii multi-byte character sets data and should be surrounded by
> > #ifdef LOCALE rather than replacing current codes surrounded by
> > #ifndef LOCALE.
> 
> I am worried about this patch too.  Under MULTIBYTE could it
> generate invalid characters?  Also, do all non-ASCII locales sort
> codes 0-126 in the same order as ASCII?  I didn't think they do,
> but I'm not an expert.
> 
> The approach I was considering for fixing the problem was to use a
> loop that would repeatedly try to generate a string greater than the
> prefix string.  The basic loop step would increment the rightmost
> byte as Goran has done (or, if it's already up to the limit, chop
> it off and increment the next character position).  Then test to
> see whether the '<' operator actually believes the result is
> greater than the given prefix, and repeat if not.  This avoids making
> any strong assumptions about the sort order of different character
> codes.  However, there are two significant issues that would have
> to be surmounted to make it work reliably:
> 
> 1. In MULTIBYTE mode incrementing the rightmost byte might yield
> an illegal multibyte character.  Some way to prevent or detect this
> would be needed, lest it confuse the comparison operator.  I think
> we have some multibyte routines that could be used to check for
> a valid result, but I haven't looked into it.
> 
> 2. I think there are some locales out there that have context-
> sensitive sorting rules, ie, a given character string may sort
> differently than you'd expect from considering the characters in
> isolation.  For example, in German isn't "ss" treated specially?
> If "pqrss" does not sort between "pqrs" and "pqrt" then the entire
> premise of *both* sides of the LIKE optimization falls apart,
> because you can't be sure what will happen when comparing a prefix
> string like "pqrs" against longer strings from the database.
> I do not know if this is really a problem, nor what we could do
> to avoid it if it is.
> 
>             regards, tom lane
> 
> ************
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026