Thread: regexp character class locale awareness patch

regexp character class locale awareness patch

From
Manuel Sugawara
Date:
Attached is a pacth against 7.2 which adds locale awareness to the
character classes of the regular expression engine. Please consider
including this feature to postgreSQL.

Regards,
Manuel.

Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Can someone who is multbyte-aware comment on this patch?  Thanks.

---------------------------------------------------------------------------

Manuel Sugawara wrote:
> Attached is a pacth against 7.2 which adds locale awareness to
> the character classes of the regular expression engine. Please
> consider including this feature to postgreSQL.
>
> Regards, Manuel.

Content-Description: regexp character class locale awareness patch

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
*** src/backend/regex/regcomp.c.org    Sun Mar 17 16:39:13 2002
--- src/backend/regex/regcomp.c    Sun Mar 17 16:53:43 2002
***************
*** 47,53 ****
--- 47,64 ----
  #include "regex/regex.h"
  #include "regex/utils.h"
  #include "regex/regex2.h"
+ #ifdef USE_LOCALE
+ struct cclass
+ {
+     char *name;
+     char *chars;
+     char *multis;
+ };
+ static struct cclass* cclasses = NULL;
+ static struct cclass* cclass_init(void);
+ #else
  #include "regex/cclass.h"
+ #endif /* USE_LOCALE */
  #include "regex/cname.h"

  /*
***************
*** 174,179 ****
--- 185,195 ----
      pg_wchar   *wcp;
  #endif

+ #ifdef USE_LOCALE
+     if ( cclasses == NULL )
+         cclasses = cclass_init();
+ #endif /* USE_LOCALE */
+
  #ifdef REDEBUG
  #define  GOODFLAGS(f)     (f)
  #else
***************
*** 884,890 ****
      struct cclass *cp;
      size_t        len;
      char       *u;
!     char        c;

      while (MORE() && pg_isalpha(PEEK()))
          NEXT();
--- 900,906 ----
      struct cclass *cp;
      size_t        len;
      char       *u;
!     unsigned char        c;

      while (MORE() && pg_isalpha(PEEK()))
          NEXT();
***************
*** 905,911 ****

      u = cp->chars;
      while ((c = *u++) != '\0')
!         CHadd(cs, c);
      for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
          MCadd(p, cs, u);
  }
--- 921,927 ----

      u = cp->chars;
      while ((c = *u++) != '\0')
!         CHadd(cs, c);
      for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
          MCadd(p, cs, u);
  }
***************
*** 1716,1718 ****
--- 1732,1796 ----
      return (islower((unsigned char) c));
  #endif
  }
+
+ #ifdef USE_LOCALE
+ static struct cclass *
+ cclass_init(void)
+ {
+     struct cclass *cp = NULL;
+     struct cclass *classes = NULL;
+     struct cclass_factory
+     {
+         char *name;
+         int (*func)(int);
+         char *chars;
+     } cclass_factories [] =
+         {
+             { "alnum", isalnum, NULL },
+             { "alpha", isalpha, NULL },
+             { "blank", NULL, " \t" },
+             { "cntrl", iscntrl, NULL },
+             { "digit", NULL, "0123456789" },
+             { "graph", isgraph, NULL },
+             { "lower", islower, NULL },
+             { "print", isprint, NULL },
+             { "punct", ispunct, NULL },
+             { "space", NULL, "\t\n\v\f\r " },
+             { "upper", isupper, NULL },
+             { "xdigit", isxdigit, NULL },
+             { NULL, NULL, NULL }
+         };
+     struct cclass_factory *cf = NULL;
+
+     classes = malloc(sizeof(struct cclass) * (sizeof(cclass_factories) / sizeof(struct cclass_factory)));
+     if (classes == NULL)
+         elog(ERROR,"cclass_init: out of memory");
+
+     cp = classes;
+     for(cf = cclass_factories; cf->name != NULL; cf++)
+         {
+             cp->name = strdup(cf->name);
+             if ( cf->chars )
+                 cp->chars = strdup(cf->chars);
+             else
+                 {
+                     int x = 0, y = 0;
+                     cp->chars = malloc(sizeof(char) * 256);
+                     if (cp->chars == NULL)
+                         elog(ERROR,"cclass_init: out of memory");
+                     for (x = 0; x < 256; x++)
+                         {
+                             if((cf->func)(x))
+                                 *(cp->chars + y++) = x;
+                         }
+                     *(cp->chars + y) = '\0';
+                 }
+             cp->multis = "";
+             cp++;
+         }
+     cp->name = cp->chars = NULL;
+     cp->multis = "";
+
+     return classes;
+ }
+ #endif /* USE_LOCALE */

Re: regexp character class locale awareness patch

From
Tatsuo Ishii
Date:
> Can someone who is multbyte-aware comment on this patch?  Thanks.

I thought the patch is not relevant to multibyte support?
--
Tatsuo Ishii


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> > Can someone who is multbyte-aware comment on this patch?  Thanks.
> 
> I thought the patch is not relevant to multibyte support?

Sorry, yes, it is for locale.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Peter Eisentraut
Date:
Whatever you do with this patch, remember that the USE_LOCALE symbol is
gone.

Bruce Momjian writes:

>
> Can someone who is multbyte-aware comment on this patch?  Thanks.
>
> ---------------------------------------------------------------------------
>
> Manuel Sugawara wrote:
> > Attached is a pacth against 7.2 which adds locale awareness to
> > the character classes of the regular expression engine. Please
> > consider including this feature to postgreSQL.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: regexp character class locale awareness patch

From
Tatsuo Ishii
Date:
> Whatever you do with this patch, remember that the USE_LOCALE symbol is
> gone.

I thought we have some way to tern off locale support at the configure
time.
--
Tatsuo Ishii


Re: regexp character class locale awareness patch

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > Whatever you do with this patch, remember that the USE_LOCALE symbol is
> > gone.
>
> I thought we have some way to tern off locale support at the configure
> time.

You do it at initdb time now.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: regexp character class locale awareness patch

From
Tatsuo Ishii
Date:
> Whatever you do with this patch, remember that the USE_LOCALE symbol is
> gone.

Then the patches should be modified.
--
Tatsuo Ishii


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> > Whatever you do with this patch, remember that the USE_LOCALE symbol is
> > gone.
> 
> Then the patches should be modified.

Yes, I am not quite sure how to do that.  I will research it unless
someone else lends a hand.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> Tatsuo Ishii wrote:
> > > Whatever you do with this patch, remember that the USE_LOCALE symbol is
> > > gone.
> >
> > Then the patches should be modified.
>
> Yes, I am not quite sure how to do that.  I will research it unless
> someone else lends a hand.

Basically, you manually preprocess the patch to include the USE_LOCALE
branch and remove the not USE_LOCALE branch.  However, if the no-locale
branches have significant performance benefits then it might be worth
pondering setting up some optimizations.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: regexp character class locale awareness patch

From
Manuel Sugawara
Date:
According to POSIX -regex (7)-, standard character class are:
             alnum       digit       punct             alpha       graph       space             blank       lower
upper             cntrl       print       xdigi
 

Many of that classes are different in different locales, and currently
all work as if the localization were C. Many of those tests have
multibyte issues, however with the patch postgres will work for
one-byte encondings, which is better than nothing. If someone
(Tatsuo?) gives some advice I will work in the multibyte version.

Peter Eisentraut <peter_e@gmx.net> writes:
>
> Basically, you manually preprocess the patch to include the
> USE_LOCALE branch and remove the not USE_LOCALE branch.

Yeah, that should work. You may also remove include/regex/cclass.h
since it will not be used any more.

> However, if the no-locale branches have significant performance
> benefits then it might be worth pondering setting up some
> optimizations.

This is not the case.

Regards,
Manuel.


Re: regexp character class locale awareness patch

From
Tatsuo Ishii
Date:
> According to POSIX -regex (7)-, standard character class are:
> 
>               alnum       digit       punct
>               alpha       graph       space
>               blank       lower       upper
>               cntrl       print       xdigi
> 
> Many of that classes are different in different locales, and currently
> all work as if the localization were C. Many of those tests have
> multibyte issues, however with the patch postgres will work for
> one-byte encondings, which is better than nothing. If someone
> (Tatsuo?) gives some advice I will work in the multibyte version.

I don't think character classes are applicable for most mutibyte
encodings. Maybe only the exeception is Unicode?

> Peter Eisentraut <peter_e@gmx.net> writes:
> >
> > Basically, you manually preprocess the patch to include the
> > USE_LOCALE branch and remove the not USE_LOCALE branch.
> 
> Yeah, that should work. You may also remove include/regex/cclass.h
> since it will not be used any more.

But I don't like cclass_init() routine runs every time when reg_comp
called. In my understanding the result of cclass_init() is always
same. What about running cclass_init() in postmaster, not postgres? Or
even better in initdb time?
--
Tatsuo Ishii


Re: regexp character class locale awareness patch

From
Manuel Sugawara
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:

> I don't think character classes are applicable for most mutibyte
> encodings. Maybe only the exeception is Unicode?

Maybe, and is the only one I need ;-)

> 
> > Peter Eisentraut <peter_e@gmx.net> writes:
> > >
> > > Basically, you manually preprocess the patch to include the
> > > USE_LOCALE branch and remove the not USE_LOCALE branch.
> > 
> > Yeah, that should work. You may also remove include/regex/cclass.h
> > since it will not be used any more.
> 
> But I don't like cclass_init() routine runs every time when reg_comp
> called.

Actually it is called once per backend and only if it uses the regular
expression engine.

> In my understanding the result of cclass_init() is always
> same. 

Yes, if localization does not change. Karel once talked about the
possibility of being able to have different locales in the same
DB.

> What about running cclass_init() in postmaster, not postgres? Or
> even better in initdb time?

It might be, but ... I think that it would be nice if we leave the
door open to the possibility of having mixed locale configurations,
across data bases or even across columns of the same table.

Regards,
Manuel.


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Manuel Sugawara wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> >
> > Basically, you manually preprocess the patch to include the
> > USE_LOCALE branch and remove the not USE_LOCALE branch.
>
> Yeah, that should work. You may also remove include/regex/cclass.h
> since it will not be used any more.
>
> > However, if the no-locale branches have significant performance
> > benefits then it might be worth pondering setting up some
> > optimizations.
>
> This is not the case.

Here is a patch based on this discussion.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
Index: src/backend/regex/regcomp.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/regex/regcomp.c,v
retrieving revision 1.28
diff -c -r1.28 regcomp.c
*** src/backend/regex/regcomp.c    28 Oct 2001 06:25:49 -0000    1.28
--- src/backend/regex/regcomp.c    16 Apr 2002 23:12:38 -0000
***************
*** 47,53 ****
  #include "regex/regex.h"
  #include "regex/utils.h"
  #include "regex/regex2.h"
! #include "regex/cclass.h"
  #include "regex/cname.h"

  /*
--- 47,60 ----
  #include "regex/regex.h"
  #include "regex/utils.h"
  #include "regex/regex2.h"
! struct cclass
! {
!     char *name;
!     char *chars;
!     char *multis;
! };
! static struct cclass* cclasses = NULL;
! static struct cclass* cclass_init(void);
  #include "regex/cname.h"

  /*
***************
*** 174,179 ****
--- 181,189 ----
      pg_wchar   *wcp;
  #endif

+     if ( cclasses == NULL )
+         cclasses = cclass_init();
+
  #ifdef REDEBUG
  #define  GOODFLAGS(f)     (f)
  #else
***************
*** 884,890 ****
      struct cclass *cp;
      size_t        len;
      char       *u;
!     char        c;

      while (MORE() && pg_isalpha(PEEK()))
          NEXT();
--- 894,900 ----
      struct cclass *cp;
      size_t        len;
      char       *u;
!     unsigned char        c;

      while (MORE() && pg_isalpha(PEEK()))
          NEXT();
***************
*** 905,911 ****

      u = cp->chars;
      while ((c = *u++) != '\0')
!         CHadd(cs, c);
      for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
          MCadd(p, cs, u);
  }
--- 915,921 ----

      u = cp->chars;
      while ((c = *u++) != '\0')
!         CHadd(cs, c);
      for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
          MCadd(p, cs, u);
  }
***************
*** 1715,1718 ****
--- 1725,1788 ----
  #else
      return (islower((unsigned char) c));
  #endif
+ }
+
+ static struct cclass *
+ cclass_init(void)
+ {
+     struct cclass *cp = NULL;
+     struct cclass *classes = NULL;
+     struct cclass_factory
+     {
+         char *name;
+         int (*func)(int);
+         char *chars;
+     } cclass_factories [] =
+         {
+             { "alnum", isalnum, NULL },
+             { "alpha", isalpha, NULL },
+             { "blank", NULL, " \t" },
+             { "cntrl", iscntrl, NULL },
+             { "digit", NULL, "0123456789" },
+             { "graph", isgraph, NULL },
+             { "lower", islower, NULL },
+             { "print", isprint, NULL },
+             { "punct", ispunct, NULL },
+             { "space", NULL, "\t\n\v\f\r " },
+             { "upper", isupper, NULL },
+             { "xdigit", isxdigit, NULL },
+             { NULL, NULL, NULL }
+         };
+     struct cclass_factory *cf = NULL;
+
+     classes = malloc(sizeof(struct cclass) * (sizeof(cclass_factories) / sizeof(struct cclass_factory)));
+     if (classes == NULL)
+         elog(ERROR,"cclass_init: out of memory");
+
+     cp = classes;
+     for(cf = cclass_factories; cf->name != NULL; cf++)
+         {
+             cp->name = strdup(cf->name);
+             if ( cf->chars )
+                 cp->chars = strdup(cf->chars);
+             else
+                 {
+                     int x = 0, y = 0;
+                     cp->chars = malloc(sizeof(char) * 256);
+                     if (cp->chars == NULL)
+                         elog(ERROR,"cclass_init: out of memory");
+                     for (x = 0; x < 256; x++)
+                         {
+                             if((cf->func)(x))
+                                 *(cp->chars + y++) = x;
+                         }
+                     *(cp->chars + y) = '\0';
+                 }
+             cp->multis = "";
+             cp++;
+         }
+     cp->name = cp->chars = NULL;
+     cp->multis = "";
+
+     return classes;
  }

Re: regexp character class locale awareness patch

From
Alvaro Herrera
Date:
En Tue, 16 Apr 2002 19:21:50 -0400 (EDT)
Bruce Momjian <pgman@candle.pha.pa.us> escribió:

> Here is a patch based on this discussion.

I still think the xdigit class could be handled the same way the digit
class is (by enumeration rather than using the isxdigit function). That
saves you a cicle, and I don't think there's any loss.

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"The ability to monopolize a planet is insignificant
next to the power of the source"


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:
http://candle.pha.pa.us/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.

---------------------------------------------------------------------------


Bruce Momjian wrote:
> Manuel Sugawara wrote:
> > Peter Eisentraut <peter_e@gmx.net> writes:
> > >
> > > Basically, you manually preprocess the patch to include the
> > > USE_LOCALE branch and remove the not USE_LOCALE branch.
> > 
> > Yeah, that should work. You may also remove include/regex/cclass.h
> > since it will not be used any more.
> > 
> > > However, if the no-locale branches have significant performance
> > > benefits then it might be worth pondering setting up some
> > > optimizations.
> > 
> > This is not the case.
> 
> Here is a patch based on this discussion.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

> Index: src/backend/regex/regcomp.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/regex/regcomp.c,v
> retrieving revision 1.28
> diff -c -r1.28 regcomp.c
> *** src/backend/regex/regcomp.c    28 Oct 2001 06:25:49 -0000    1.28
> --- src/backend/regex/regcomp.c    16 Apr 2002 23:12:38 -0000
> ***************
> *** 47,53 ****
>   #include "regex/regex.h"
>   #include "regex/utils.h"
>   #include "regex/regex2.h"
> ! #include "regex/cclass.h"
>   #include "regex/cname.h"
>   
>   /*
> --- 47,60 ----
>   #include "regex/regex.h"
>   #include "regex/utils.h"
>   #include "regex/regex2.h"
> ! struct cclass
> ! {
> !     char *name;
> !     char *chars;
> !     char *multis;
> ! };
> ! static struct cclass* cclasses = NULL;
> ! static struct cclass* cclass_init(void);
>   #include "regex/cname.h"
>   
>   /*
> ***************
> *** 174,179 ****
> --- 181,189 ----
>       pg_wchar   *wcp;
>   #endif
>   
> +     if ( cclasses == NULL )
> +         cclasses = cclass_init();
> +     
>   #ifdef REDEBUG
>   #define  GOODFLAGS(f)     (f)
>   #else
> ***************
> *** 884,890 ****
>       struct cclass *cp;
>       size_t        len;
>       char       *u;
> !     char        c;
>   
>       while (MORE() && pg_isalpha(PEEK()))
>           NEXT();
> --- 894,900 ----
>       struct cclass *cp;
>       size_t        len;
>       char       *u;
> !     unsigned char        c;
>   
>       while (MORE() && pg_isalpha(PEEK()))
>           NEXT();
> ***************
> *** 905,911 ****
>   
>       u = cp->chars;
>       while ((c = *u++) != '\0')
> !         CHadd(cs, c);
>       for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
>           MCadd(p, cs, u);
>   }
> --- 915,921 ----
>   
>       u = cp->chars;
>       while ((c = *u++) != '\0')
> !         CHadd(cs, c);   
>       for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
>           MCadd(p, cs, u);
>   }
> ***************
> *** 1715,1718 ****
> --- 1725,1788 ----
>   #else
>       return (islower((unsigned char) c));
>   #endif
> + }
> + 
> + static struct cclass *
> + cclass_init(void)
> + {
> +     struct cclass *cp = NULL;
> +     struct cclass *classes = NULL;
> +     struct cclass_factory
> +     {
> +         char *name;
> +         int (*func)(int);
> +         char *chars;
> +     } cclass_factories [] =
> +         {
> +             { "alnum", isalnum, NULL },
> +             { "alpha", isalpha, NULL },
> +             { "blank", NULL, " \t" },
> +             { "cntrl", iscntrl, NULL },
> +             { "digit", NULL, "0123456789" },
> +             { "graph", isgraph, NULL },
> +             { "lower", islower, NULL },
> +             { "print", isprint, NULL },
> +             { "punct", ispunct, NULL },
> +             { "space", NULL, "\t\n\v\f\r " },
> +             { "upper", isupper, NULL },
> +             { "xdigit", isxdigit, NULL },
> +             { NULL, NULL, NULL }
> +         };
> +     struct cclass_factory *cf = NULL;
> + 
> +     classes = malloc(sizeof(struct cclass) * (sizeof(cclass_factories) / sizeof(struct cclass_factory)));
> +     if (classes == NULL)
> +         elog(ERROR,"cclass_init: out of memory");
> +     
> +     cp = classes;
> +     for(cf = cclass_factories; cf->name != NULL; cf++)
> +         {
> +             cp->name = strdup(cf->name);
> +             if ( cf->chars )
> +                 cp->chars = strdup(cf->chars);
> +             else
> +                 {
> +                     int x = 0, y = 0;
> +                     cp->chars = malloc(sizeof(char) * 256);
> +                     if (cp->chars == NULL)
> +                         elog(ERROR,"cclass_init: out of memory");
> +                     for (x = 0; x < 256; x++)
> +                         {
> +                             if((cf->func)(x))
> +                                 *(cp->chars + y++) = x;                            
> +                         }
> +                     *(cp->chars + y) = '\0';
> +                 }
> +             cp->multis = "";
> +             cp++;
> +         }
> +     cp->name = cp->chars = NULL;
> +     cp->multis = "";
> +     
> +     return classes;
>   }

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
OK, once I apply the original patch, please submit a patch for this and
people can comment on it.  Thanks.


---------------------------------------------------------------------------

Alvaro Herrera wrote:
> En Tue, 16 Apr 2002 19:21:50 -0400 (EDT)
> Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> 
> > Here is a patch based on this discussion.
> 
> I still think the xdigit class could be handled the same way the digit
> class is (by enumeration rather than using the isxdigit function). That
> saves you a cicle, and I don't think there's any loss.
> 
> -- 
> Alvaro Herrera (<alvherre[a]atentus.com>)
> "The ability to monopolize a planet is insignificant
> next to the power of the source"
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> En Tue, 16 Apr 2002 19:21:50 -0400 (EDT)
> Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> 
> > Here is a patch based on this discussion.
> 
> I still think the xdigit class could be handled the same way the digit
> class is (by enumeration rather than using the isxdigit function). That
> saves you a cicle, and I don't think there's any loss.

In fact, I will email you when I apply the original patch.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Manuel Sugawara
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> Alvaro Herrera wrote:
> > En Tue, 16 Apr 2002 19:21:50 -0400 (EDT)
> > Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> > 
> > > Here is a patch based on this discussion.
> > 
> > I still think the xdigit class could be handled the same way the digit
> > class is (by enumeration rather than using the isxdigit function). That
> > saves you a cicle, and I don't think there's any loss.
> 
> In fact, I will email you when I apply the original patch.

I miss that case :-(. Here is the pached patch.

Regards,
Manuel.

Index: src/backend/regex/regcomp.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/regex/regcomp.c,v
retrieving revision 1.28
diff -c -r1.28 regcomp.c
*** src/backend/regex/regcomp.c    28 Oct 2001 06:25:49 -0000    1.28
--- src/backend/regex/regcomp.c    16 Apr 2002 23:12:38 -0000
***************
*** 47,53 **** #include "regex/regex.h" #include "regex/utils.h" #include "regex/regex2.h"
! #include "regex/cclass.h" #include "regex/cname.h"  /*
--- 47,60 ---- #include "regex/regex.h" #include "regex/utils.h" #include "regex/regex2.h"
! struct cclass
! {
!     char *name;
!     char *chars;
!     char *multis;
! };
! static struct cclass* cclasses = NULL;
! static struct cclass* cclass_init(void); #include "regex/cname.h"  /*
***************
*** 174,179 ****
--- 181,189 ----     pg_wchar   *wcp; #endif 
+     if ( cclasses == NULL )
+         cclasses = cclass_init();
+      #ifdef REDEBUG #define  GOODFLAGS(f)     (f) #else
***************
*** 884,890 ****     struct cclass *cp;     size_t        len;     char       *u;
!     char        c;      while (MORE() && pg_isalpha(PEEK()))         NEXT();
--- 894,900 ----     struct cclass *cp;     size_t        len;     char       *u;
!     unsigned char        c;      while (MORE() && pg_isalpha(PEEK()))         NEXT();
***************
*** 905,911 ****      u = cp->chars;     while ((c = *u++) != '\0')
!         CHadd(cs, c);     for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)         MCadd(p, cs, u); }
--- 915,921 ----      u = cp->chars;     while ((c = *u++) != '\0')
!         CHadd(cs, c);        for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)         MCadd(p, cs, u); }
***************
*** 1715,1718 ****
--- 1725,1788 ---- #else     return (islower((unsigned char) c)); #endif
+ }
+ 
+ static struct cclass *
+ cclass_init(void)
+ {
+     struct cclass *cp = NULL;
+     struct cclass *classes = NULL;
+     struct cclass_factory
+     {
+         char *name;
+         int (*func)(int);
+         char *chars;
+     } cclass_factories [] =
+         {
+             { "alnum", isalnum, NULL },
+             { "alpha", isalpha, NULL },
+             { "blank", NULL, " \t" },
+             { "cntrl", iscntrl, NULL },
+             { "digit", NULL, "0123456789" },
+             { "graph", isgraph, NULL },
+             { "lower", islower, NULL },
+             { "print", isprint, NULL },
+             { "punct", ispunct, NULL },
+             { "space", NULL, "\t\n\v\f\r " },
+             { "upper", isupper, NULL },
+             { "xdigit",NULL, "abcdefABCDEF0123456789" },
+             { NULL, NULL, NULL }
+         };
+     struct cclass_factory *cf = NULL;
+ 
+     classes = malloc(sizeof(struct cclass) * (sizeof(cclass_factories) / sizeof(struct cclass_factory)));
+     if (classes == NULL)
+         elog(ERROR,"cclass_init: out of memory");
+     
+     cp = classes;
+     for(cf = cclass_factories; cf->name != NULL; cf++)
+         {
+             cp->name = strdup(cf->name);
+             if ( cf->chars )
+                 cp->chars = strdup(cf->chars);
+             else
+                 {
+                     int x = 0, y = 0;
+                     cp->chars = malloc(sizeof(char) * 256);
+                     if (cp->chars == NULL)
+                         elog(ERROR,"cclass_init: out of memory");
+                     for (x = 0; x < 256; x++)
+                         {
+                             if((cf->func)(x))
+                                 *(cp->chars + y++) = x;                            
+                         }
+                     *(cp->chars + y) = '\0';
+                 }
+             cp->multis = "";
+             cp++;
+         }
+     cp->name = cp->chars = NULL;
+     cp->multis = "";
+     
+     return classes; }


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
OK, previous patch removed.

This patch has been added to the PostgreSQL unapplied patches list at:
http://candle.pha.pa.us/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.


---------------------------------------------------------------------------

Manuel Sugawara wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > Alvaro Herrera wrote:
> > > En Tue, 16 Apr 2002 19:21:50 -0400 (EDT)
> > > Bruce Momjian <pgman@candle.pha.pa.us> escribi?:
> > > 
> > > > Here is a patch based on this discussion.
> > > 
> > > I still think the xdigit class could be handled the same way the digit
> > > class is (by enumeration rather than using the isxdigit function). That
> > > saves you a cicle, and I don't think there's any loss.
> > 
> > In fact, I will email you when I apply the original patch.
> 
> I miss that case :-(. Here is the pached patch.
> 
> Regards,
> Manuel.
> 
> Index: src/backend/regex/regcomp.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/regex/regcomp.c,v
> retrieving revision 1.28
> diff -c -r1.28 regcomp.c
> *** src/backend/regex/regcomp.c    28 Oct 2001 06:25:49 -0000    1.28
> --- src/backend/regex/regcomp.c    16 Apr 2002 23:12:38 -0000
> ***************
> *** 47,53 ****
>   #include "regex/regex.h"
>   #include "regex/utils.h"
>   #include "regex/regex2.h"
> ! #include "regex/cclass.h"
>   #include "regex/cname.h"
>   
>   /*
> --- 47,60 ----
>   #include "regex/regex.h"
>   #include "regex/utils.h"
>   #include "regex/regex2.h"
> ! struct cclass
> ! {
> !     char *name;
> !     char *chars;
> !     char *multis;
> ! };
> ! static struct cclass* cclasses = NULL;
> ! static struct cclass* cclass_init(void);
>   #include "regex/cname.h"
>   
>   /*
> ***************
> *** 174,179 ****
> --- 181,189 ----
>       pg_wchar   *wcp;
>   #endif
>   
> +     if ( cclasses == NULL )
> +         cclasses = cclass_init();
> +     
>   #ifdef REDEBUG
>   #define  GOODFLAGS(f)     (f)
>   #else
> ***************
> *** 884,890 ****
>       struct cclass *cp;
>       size_t        len;
>       char       *u;
> !     char        c;
>   
>       while (MORE() && pg_isalpha(PEEK()))
>           NEXT();
> --- 894,900 ----
>       struct cclass *cp;
>       size_t        len;
>       char       *u;
> !     unsigned char        c;
>   
>       while (MORE() && pg_isalpha(PEEK()))
>           NEXT();
> ***************
> *** 905,911 ****
>   
>       u = cp->chars;
>       while ((c = *u++) != '\0')
> !         CHadd(cs, c);
>       for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
>           MCadd(p, cs, u);
>   }
> --- 915,921 ----
>   
>       u = cp->chars;
>       while ((c = *u++) != '\0')
> !         CHadd(cs, c);   
>       for (u = cp->multis; *u != '\0'; u += strlen(u) + 1)
>           MCadd(p, cs, u);
>   }
> ***************
> *** 1715,1718 ****
> --- 1725,1788 ----
>   #else
>       return (islower((unsigned char) c));
>   #endif
> + }
> + 
> + static struct cclass *
> + cclass_init(void)
> + {
> +     struct cclass *cp = NULL;
> +     struct cclass *classes = NULL;
> +     struct cclass_factory
> +     {
> +         char *name;
> +         int (*func)(int);
> +         char *chars;
> +     } cclass_factories [] =
> +         {
> +             { "alnum", isalnum, NULL },
> +             { "alpha", isalpha, NULL },
> +             { "blank", NULL, " \t" },
> +             { "cntrl", iscntrl, NULL },
> +             { "digit", NULL, "0123456789" },
> +             { "graph", isgraph, NULL },
> +             { "lower", islower, NULL },
> +             { "print", isprint, NULL },
> +             { "punct", ispunct, NULL },
> +             { "space", NULL, "\t\n\v\f\r " },
> +             { "upper", isupper, NULL },
> +             { "xdigit",NULL, "abcdefABCDEF0123456789" },
> +             { NULL, NULL, NULL }
> +         };
> +     struct cclass_factory *cf = NULL;
> + 
> +     classes = malloc(sizeof(struct cclass) * (sizeof(cclass_factories) / sizeof(struct cclass_factory)));
> +     if (classes == NULL)
> +         elog(ERROR,"cclass_init: out of memory");
> +     
> +     cp = classes;
> +     for(cf = cclass_factories; cf->name != NULL; cf++)
> +         {
> +             cp->name = strdup(cf->name);
> +             if ( cf->chars )
> +                 cp->chars = strdup(cf->chars);
> +             else
> +                 {
> +                     int x = 0, y = 0;
> +                     cp->chars = malloc(sizeof(char) * 256);
> +                     if (cp->chars == NULL)
> +                         elog(ERROR,"cclass_init: out of memory");
> +                     for (x = 0; x < 256; x++)
> +                         {
> +                             if((cf->func)(x))
> +                                 *(cp->chars + y++) = x;                            
> +                         }
> +                     *(cp->chars + y) = '\0';
> +                 }
> +             cp->multis = "";
> +             cp++;
> +         }
> +     cp->name = cp->chars = NULL;
> +     cp->multis = "";
> +     
> +     return classes;
>   }
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Tatsuo Ishii
Date:
> I miss that case :-(. Here is the pached patch.
> 
> Regards,
> Manuel.

I also suggest that cclass_init() is called only if the locale is not
"C".
--
Tatsuo Ishii


Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> > I miss that case :-(. Here is the pached patch.
> > 
> > Regards,
> > Manuel.
> 
> I also suggest that cclass_init() is called only if the locale is not
> "C".

OK, patch on hold while this is addressed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Manuel Sugawara
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> Tatsuo Ishii wrote:
> > > I miss that case :-(. Here is the pached patch.
> > >
> > > Regards,
> > > Manuel.
> >
> > I also suggest that cclass_init() is called only if the locale is not
> > "C".
>
> OK, patch on hold while this is addressed.

Here is a patch which addresses Tatsuo's concerns (it does return an
static struct instead of constructing it).

Regards,
Manuel.


Attachment

Re: regexp character class locale awareness patch

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:
http://candle.pha.pa.us/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.

---------------------------------------------------------------------------


Manuel Sugawara wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > Tatsuo Ishii wrote:
> > > > I miss that case :-(. Here is the pached patch.
> > > > 
> > > > Regards,
> > > > Manuel.
> > > 
> > > I also suggest that cclass_init() is called only if the locale is not
> > > "C".
> > 
> > OK, patch on hold while this is addressed.
> 
> Here is a patch which addresses Tatsuo's concerns (it does return an
> static struct instead of constructing it).
> 
> Regards,
> Manuel.
> 

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: regexp character class locale awareness patch

From
Alvaro Herrera
Date:
En 17 Apr 2002 22:53:32 -0600
Manuel Sugawara <masm@fciencias.unam.mx> escribió:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > Tatsuo Ishii wrote:
> > > > I miss that case :-(. Here is the pached patch.
> > > > 
> > > > Regards,
> > > > Manuel.
> > > 
> > > I also suggest that cclass_init() is called only if the locale is not
> > > "C".
> > 
> > OK, patch on hold while this is addressed.
> 
> Here is a patch which addresses Tatsuo's concerns (it does return an
> static struct instead of constructing it).

Is there a reason to use "" instead of NULL in the "multis" member of
that static struct?

-- 
Alvaro Herrera (<alvherre[a]atentus.com>)
"La virtud es el justo medio entre dos defectos" (Aristoteles)


Re: regexp character class locale awareness patch

From
Manuel Sugawara
Date:
Alvaro Herrera <alvherre@atentus.com> writes:

> En 17 Apr 2002 22:53:32 -0600
> Manuel Sugawara <masm@fciencias.unam.mx> escribió:
>
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >
> > > Tatsuo Ishii wrote:
> > > > > I miss that case :-(. Here is the pached patch.
> > > > >
> > > > > Regards,
> > > > > Manuel.
> > > >
> > > > I also suggest that cclass_init() is called only if the locale is not
> > > > "C".
> > >
> > > OK, patch on hold while this is addressed.
> >
> > Here is a patch which addresses Tatsuo's concerns (it does return an
> > static struct instead of constructing it).
>
> Is there a reason to use "" instead of NULL in the "multis" member of
> that static struct?

Yes, read the code.

Regards,
Manuel.