Thread: Re: Collation rules and multi-lingual databases

Re: Collation rules and multi-lingual databases

From

Greg Stark

Date:

22 August 2003, 19:16:26

So, I needed a way to sort using collation rules other than the one the
database was built with. So I wrote up the following function exposing strxfrm
with an extra parameter to specify the LC_COLLATE value to use.

This is my first C function so I'm really unsure that I've done the right
thing. For the most part I pattern-matched off the string_io code in the
contrib directory.

In particular I'm unsure about the code postgres-interfacing code in
c_varcharxfrm which makes an extra copy of both parameters that are passed in
and an extra copy of the result value. Are varchars guaranteed to be
nul-terminated? If so I can dispose of two of the copies. And I can probably
eliminate the copying of the result by alloting extra space when I allocate it
initially.

But more generally. Would it make more sense to use text or bytea or something
else to store these opaque binary strings? At least with glibc they tend to be
unreadable anyways.

Other caveats: It's condemned to be permanently non-threadsafe because the
whole locale system is a non thread-safe API. Also I fear some systems will
leak memory like a sieve when you call setlocale a few thousand times instead
of the 1 time at initialization that they foresaw. At least glibc doesn't seem
to leak in my brief testing.

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?





--
greg

Attachment

Re: Collation rules and multi-lingual databases

From

Peter Eisentraut

Date:

22 August 2003, 19:53:02

Greg Stark writes:

> This is my first C function so I'm really unsure that I've done the right
> thing. For the most part I pattern-matched off the string_io code in the
> contrib directory.

That was just about the worst example you could have picked.  Please
forget everything you have seen and start by reading the documentation.
In particular, learn about the version 1 call convention and about the
PostgreSQL license.  And read some code under src/backend/utils/adt.

> In particular I'm unsure about the code postgres-interfacing code in
> c_varcharxfrm which makes an extra copy of both parameters that are passed in
> and an extra copy of the result value. Are varchars guaranteed to be
> nul-terminated?

They are guaranteed not to be null-terminated.

> But more generally. Would it make more sense to use text or bytea or something
> else to store these opaque binary strings? At least with glibc they tend to be
> unreadable anyways.

bytea

> If it's deemed a reasonable approach and nobody has any fatal flaws then I
> expect it would be useful to put in the contrib directory?

I'd expect it to be too slow to be useful.  Have you run performance tests?

-- 
Peter Eisentraut   peter_e@gmx.net

Re: Collation rules and multi-lingual databases

From

Stephan Szabo

Date:

22 August 2003, 20:08:11

On 22 Aug 2003, Greg Stark wrote:

>
> So, I needed a way to sort using collation rules other than the one the
> database was built with. So I wrote up the following function exposing strxfrm
> with an extra parameter to specify the LC_COLLATE value to use.
>
> This is my first C function so I'm really unsure that I've done the right
> thing. For the most part I pattern-matched off the string_io code in the
> contrib directory.
>
> In particular I'm unsure about the code postgres-interfacing code in
> c_varcharxfrm which makes an extra copy of both parameters that are passed in
> and an extra copy of the result value. Are varchars guaranteed to be
> nul-terminated? If so I can dispose of two of the copies. And I can probably
> eliminate the copying of the result by alloting extra space when I allocate it
> initially.
>
> But more generally. Would it make more sense to use text or bytea or something
> else to store these opaque binary strings? At least with glibc they tend to be
> unreadable anyways.
>
> Other caveats: It's condemned to be permanently non-threadsafe because the
> whole locale system is a non thread-safe API. Also I fear some systems will
> leak memory like a sieve when you call setlocale a few thousand times instead
> of the 1 time at initialization that they foresaw. At least glibc doesn't seem
> to leak in my brief testing.
>
> If it's deemed a reasonable approach and nobody has any fatal flaws then I
> expect it would be useful to put in the contrib directory?

I'm not sure that ERROR if the locale cannot be put back is sufficient
(although that case should be rare or non-existant). Unless something else
in the system resets the locale, after your transaction rolls back, you're
in a dangerous state.  I'd think FATAL would be better.

Re: Collation rules and multi-lingual databases

From

Tom Lane

Date:

22 August 2003, 20:53:55

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> On 22 Aug 2003, Greg Stark wrote:
>> If it's deemed a reasonable approach and nobody has any fatal flaws then I
>> expect it would be useful to put in the contrib directory?

> I'm not sure that ERROR if the locale cannot be put back is sufficient
> (although that case should be rare or non-existant).

A bigger risk is that something might elog(ERROR) while you have the
"wrong" locale set, denying you the chance to put back the right one.
I think this code is not nearly paranoid enough about how much it does
while the wrong locale is set.

> Unless something else
> in the system resets the locale, after your transaction rolls back, you're
> in a dangerous state.  I'd think FATAL would be better.

I'd go so far as to make it a critical section --- that ensures that any
ERROR will be turned to FATAL, even if it's in a subroutine you call.
        regards, tom lane

Re: Collation rules and multi-lingual databases

From

Stephan Szabo

Date:

22 August 2003, 22:29:16

On Fri, 22 Aug 2003, Tom Lane wrote:

> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> > On 22 Aug 2003, Greg Stark wrote:
> >> If it's deemed a reasonable approach and nobody has any fatal flaws then I
> >> expect it would be useful to put in the contrib directory?
>
> > I'm not sure that ERROR if the locale cannot be put back is sufficient
> > (although that case should be rare or non-existant).
>
> A bigger risk is that something might elog(ERROR) while you have the
> "wrong" locale set, denying you the chance to put back the right one.
> I think this code is not nearly paranoid enough about how much it does
> while the wrong locale is set.

True, there are calls to palloc, elog, etc inside there, although the elog
could be removed.

> > Unless something else
> > in the system resets the locale, after your transaction rolls back, you're
> > in a dangerous state.  I'd think FATAL would be better.
>
> I'd go so far as to make it a critical section --- that ensures that any
> ERROR will be turned to FATAL, even if it's in a subroutine you call.

I didn't know we could do that, could be handy, although the comments
imply that it turns into PANIC which would force a complete restart.  Then
again, it's better than a corrupted database.

Re: Collation rules and multi-lingual databases

From

Stephan Szabo

Date:

23 August 2003, 03:55:55

On Fri, 22 Aug 2003, Stephan Szabo wrote:

> On Fri, 22 Aug 2003, Tom Lane wrote:
>
> > Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> > > On 22 Aug 2003, Greg Stark wrote:
> > >> If it's deemed a reasonable approach and nobody has any fatal flaws then I
> > >> expect it would be useful to put in the contrib directory?
> >
> > > I'm not sure that ERROR if the locale cannot be put back is sufficient
> > > (although that case should be rare or non-existant).
> >
> > A bigger risk is that something might elog(ERROR) while you have the
> > "wrong" locale set, denying you the chance to put back the right one.
> > I think this code is not nearly paranoid enough about how much it does
> > while the wrong locale is set.
>
> True, there are calls to palloc, elog, etc inside there, although the elog
> could be removed.

Since most of that work is for an exceptional case, maybe it'd be safer
(although slower) to structure the function as

setlocale
call strxfrm (and that's it)
setlocale back
if there wasn't enough spacemake a new buffersetlocalecall strxfrm (and that's it)setlocale back

Probably putting the sl/strxfrm/sl into its own function.

Re: Collation rules and multi-lingual databases

From

Greg Stark

Date:

23 August 2003, 06:51:31

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> Since most of that work is for an exceptional case, maybe it'd be safer
> (although slower) to structure the function as

Yeah I thought of that. But if making it a critical section is cheap then it's
probably a better approach. The problem with restoring the locale for the
palloc is that if the user is unlucky he might sort a table of thousands of
strings that all trigger the exception case.

The glibc docs sample code suggests using 2x the original string length for
the initial buffer. My testing showed that *always* triggered the exceptional
case. A bit of experimentation lead to the 3x+4 which eliminates it except for
0 and 1 byte strings. I'm still tweaking it. But on another OS, or in a more
complex collation locale maybe you would still trigger it a lot. Even as it is
if you happy to try to sort a large list of single character strings you would
trigger it a lot.

I have some documentation reading to do apparently before I can fix this up.

> setlocale
> call strxfrm (and that's it)
> setlocale back
> if there wasn't enough space
>  make a new buffer
>  setlocale
>  call strxfrm (and that's it)
>  setlocale back
> 
> Probably putting the sl/strxfrm/sl into its own function.

-- 
greg

Re: Collation rules and multi-lingual databases

From

Stephan Szabo

Date:

23 August 2003, 12:41:22

On 23 Aug 2003, Greg Stark wrote:

> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>
> > Since most of that work is for an exceptional case, maybe it'd be safer
> > (although slower) to structure the function as
>
> Yeah I thought of that. But if making it a critical section is cheap then it's
> probably a better approach. The problem with restoring the locale for the
> palloc is that if the user is unlucky he might sort a table of thousands of
> strings that all trigger the exception case.

True. I still worry about the critical section since an error will cause
the entire database system to restart, but as I'd also forgotten about
signals which might cause problems (uncertain without looking) I'm not
sure you have a good alternative even not counting the speed issues.

Re: Collation rules and multi-lingual databases

From

Joe Conway

Date:

23 August 2003, 13:19:56

Greg Stark wrote:
> Yeah I thought of that. But if making it a critical section is cheap then it's
> probably a better approach. The problem with restoring the locale for the
> palloc is that if the user is unlucky he might sort a table of thousands of
> strings that all trigger the exception case.
> 

What about something like this?
8<--------------------------------

#include <setjmp.h>
#include <string.h>

#include "postgres.h"
#include "fmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"

#define GET_STR(textp) \  DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
#define GET_BYTEA(str_) \  DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
#define MAX_BYTEA_LEN    0x3fffffff

/* * pg_strxfrm - Function to convert string similar to the strxfrm C * function using a specified locale. */
extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(pg_strxfrm);

Datum
pg_strxfrm(PG_FUNCTION_ARGS)
{  char       *str = GET_STR(PG_GETARG_TEXT_P(0));  size_t      str_len = strlen(str);  char       *localestr =
GET_STR(PG_GETARG_TEXT_P(1)); size_t      approx_trans_len = 4 + (str_len * 3);  char       *trans = (char *)
palloc(approx_trans_len+ 1);  size_t      actual_trans_len;  char       *oldlocale;  char       *newlocale;  sigjmp_buf
save_restart;
 
  if (approx_trans_len > MAX_BYTEA_LEN)    elog(ERROR, "source string too long to transform");
  oldlocale = setlocale(LC_COLLATE, NULL);  if (!oldlocale)    elog(ERROR, "setlocale failed to return a locale");
  /* catch elog while locale is set other than the default */  memcpy(&save_restart, &Warn_restart,
sizeof(save_restart)); if (sigsetjmp(Warn_restart, 1) != 0)  {    memcpy(&Warn_restart, &save_restart,
sizeof(Warn_restart));   newlocale = setlocale(LC_COLLATE, oldlocale);    if (!newlocale)      elog(PANIC, "setlocale
failedto reset locale: %s", localestr);    siglongjmp(Warn_restart, 1);  }
 
  newlocale = setlocale(LC_COLLATE, localestr);  if (!newlocale)    elog(ERROR, "setlocale failed to set a locale: %s",
localestr);
  actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
  /* if the buffer was not large enough, resize it and try again */  if (actual_trans_len >= approx_trans_len)  {
approx_trans_len= actual_trans_len + 1;    if (approx_trans_len > MAX_BYTEA_LEN)      elog(ERROR, "source string too
longto transform");
 
    trans = (char *) repalloc(trans, approx_trans_len + 1);    actual_trans_len = strxfrm(trans, str, approx_trans_len
+1);
 
    /* if the buffer still not large enough, punt */    if (actual_trans_len >= approx_trans_len)      elog(ERROR,
"strxfrmfailed, buffer insufficient");  }
 
  newlocale = setlocale(LC_COLLATE, oldlocale);  if (!newlocale)    elog(PANIC, "setlocale failed to reset locale: %s",
localestr);
  PG_RETURN_BYTEA_P(GET_BYTEA(trans));
}

8<--------------------------------

Joe

Re: Collation rules and multi-lingual databases

From

Joe Conway

Date:

23 August 2003, 13:56:31

Joe Conway wrote:
> What about something like this?

Oops! Forgot to restrore error handling. See below:

Joe

> 8<--------------------------------
> 
> #include <setjmp.h>
> #include <string.h>
> 
> #include "postgres.h"
> #include "fmgr.h"
> #include "tcop/tcopprot.h"
> #include "utils/builtins.h"
> 
> #define GET_STR(textp) \
>   DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
> #define GET_BYTEA(str_) \
>   DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
> #define MAX_BYTEA_LEN    0x3fffffff
> 
> /*
>  * pg_strxfrm - Function to convert string similar to the strxfrm C
>  * function using a specified locale.
>  */
> extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
> PG_FUNCTION_INFO_V1(pg_strxfrm);
> 
> Datum
> pg_strxfrm(PG_FUNCTION_ARGS)
> {
>   char       *str = GET_STR(PG_GETARG_TEXT_P(0));
>   size_t      str_len = strlen(str);
>   char       *localestr = GET_STR(PG_GETARG_TEXT_P(1));
>   size_t      approx_trans_len = 4 + (str_len * 3);
>   char       *trans = (char *) palloc(approx_trans_len + 1);
>   size_t      actual_trans_len;
>   char       *oldlocale;
>   char       *newlocale;
>   sigjmp_buf  save_restart;
> 
>   if (approx_trans_len > MAX_BYTEA_LEN)
>     elog(ERROR, "source string too long to transform");
> 
>   oldlocale = setlocale(LC_COLLATE, NULL);
>   if (!oldlocale)
>     elog(ERROR, "setlocale failed to return a locale");
> 
>   /* catch elog while locale is set other than the default */
>   memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
>   if (sigsetjmp(Warn_restart, 1) != 0)
>   {
>     memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
>     newlocale = setlocale(LC_COLLATE, oldlocale);
>     if (!newlocale)
>       elog(PANIC, "setlocale failed to reset locale: %s", localestr);
>     siglongjmp(Warn_restart, 1);
>   }
> 
>   newlocale = setlocale(LC_COLLATE, localestr);
>   if (!newlocale)
>     elog(ERROR, "setlocale failed to set a locale: %s", localestr);
> 
>   actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
> 
>   /* if the buffer was not large enough, resize it and try again */
>   if (actual_trans_len >= approx_trans_len)
>   {
>     approx_trans_len = actual_trans_len + 1;
>     if (approx_trans_len > MAX_BYTEA_LEN)
>       elog(ERROR, "source string too long to transform");
> 
>     trans = (char *) repalloc(trans, approx_trans_len + 1);
>     actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);
> 
>     /* if the buffer still not large enough, punt */
>     if (actual_trans_len >= approx_trans_len)
>       elog(ERROR, "strxfrm failed, buffer insufficient");
>   }
> 
>   newlocale = setlocale(LC_COLLATE, oldlocale);
>   if (!newlocale)
>     elog(PANIC, "setlocale failed to reset locale: %s", localestr);
  /* restore normal error handling */  memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));

> 
>   PG_RETURN_BYTEA_P(GET_BYTEA(trans));
> }
> 
> 8<--------------------------------
>

Re: Collation rules and multi-lingual databases

From

Greg Stark

Date:

23 August 2003, 16:34:58

Joe Conway <mail@joeconway.com> writes:

> >   if (sigsetjmp(Warn_restart, 1) != 0)
> >   {
> >     memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
> >     newlocale = setlocale(LC_COLLATE, oldlocale);
> >     if (!newlocale)
> >       elog(PANIC, "setlocale failed to reset locale: %s", localestr);
> >     siglongjmp(Warn_restart, 1);
> >   }

Well presumably we want FATAL not PANIC.

And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ?

I was afraid that was getting into bed too much with the error handling. I
have an implementation that restores the locale around the palloc and
increases the initial guess for future calls to avoid degenerate behaviour.
I'm not sure which approach is preferable.

-- 
greg

Re: Collation rules and multi-lingual databases

From

Joe Conway

Date:

23 August 2003, 18:16:17

Greg Stark wrote:
> Joe Conway <mail@joeconway.com> writes:
> 
> 
>>>  if (sigsetjmp(Warn_restart, 1) != 0)
>>>  {
>>>    memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
>>>    newlocale = setlocale(LC_COLLATE, oldlocale);
>>>    if (!newlocale)
>>>      elog(PANIC, "setlocale failed to reset locale: %s", localestr);
>>>    siglongjmp(Warn_restart, 1);
>>>  }
> 
> 
> Well presumably we want FATAL not PANIC.

Yeah, that was a bit overzealous. I really intended FATAL.

> And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ?

I'm not sure, but I think not.

> I was afraid that was getting into bed too much with the error handling. I
> have an implementation that restores the locale around the palloc and
> increases the initial guess for future calls to avoid degenerate behaviour.

Well the intention of the sigsetjmp was to avoid the need to flip the 
locale to-and-fro. Increasing the initial guess might be good, but it 
will further restrict the length of the input string you can work with. 
But I'd guess you'll not want to use this with extremely long strings 
anyway.

Joe

Re: Collation rules and multi-lingual databases

From

Tom Lane

Date:

24 August 2003, 17:19:53

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> On Fri, 22 Aug 2003, Tom Lane wrote:
>> I'd go so far as to make it a critical section --- that ensures that any
>> ERROR will be turned to FATAL, even if it's in a subroutine you call.

> I didn't know we could do that, could be handy, although the comments
> imply that it turns into PANIC which would force a complete restart.

Right, my imprecision, it actually goes to PANIC.

> Then again, it's better than a corrupted database.

Exactly.
        regards, tom lane

Re: Collation rules and multi-lingual databases

From

Tom Lane

Date:

24 August 2003, 17:32:31

Greg Stark <gsstark@mit.edu> writes:
> The glibc docs sample code suggests using 2x the original string
> length for the initial buffer. My testing showed that *always*
> triggered the exceptional case. A bit of experimentation lead to the
> 3x+4 which eliminates it except for 0 and 1 byte strings. I'm still
> tweaking it. But on another OS, or in a more complex collation locale
> maybe you would still trigger it a lot.

On HPUX it seems you always need 4x.  Also, *there are bugs* in some
platforms' implementations of strxfrm, such that an undersized buffer
may get overrun anyway.  I had originally tried to optimize the buffer
size like this in src/backend/utils/adt/selfuncs.c's use of strxfrm,
and eventually was forced to give it up as hopeless.  I strongly suggest
using the same code now seen there:
       char       *xfrmstr;       size_t        xfrmlen;       size_t        xfrmlen2;
       /*        * Note: originally we guessed at a suitable output buffer size,        * and only needed to call
strxfrmtwice if our guess was too        * small. However, it seems that some versions of Solaris have        * buggy
strxfrmthat can write past the specified buffer length        * in that scenario.  So, do it the dumb way for
portability.       *        * Yet other systems (e.g., glibc) sometimes return a smaller value        * from the second
callthan the first; thus the Assert must be <=        * not == as you'd expect.  Can't any of these people program
 * their way out of a paper bag?        */       xfrmlen = strxfrm(NULL, val, 0);       xfrmstr = (char *)
palloc(xfrmlen+ 1);       xfrmlen2 = strxfrm(xfrmstr, val, xfrmlen + 1);       Assert(xfrmlen2 <= xfrmlen);
 
        regards, tom lane