Thread: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Daniel Kalchev
Date:
Yes,

I do build with --enable-locale, but I do not build with --enable-mb and do 
not use client_encoding or server_encoding.

The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the 
environment in both server and client, and this has worked for me in 6.4.2.

Regards,
Daniel

>>>"Hiroshi Inoue" said:> > > > -----Original Message-----> > From: owner-pgsql-hackers@postgreSQL.org> >
[mailto:owner-pgsql-hackers@postgreSQL.org]OnBehalf Of Daniel Kalchev> > Sent: Wednesday, June 09, 1999 5:04 PM> > To:
pgsql-hackers@hub.org>> Subject: [HACKERS] Postgres 6.5 beta2 and beta3 problem> > > > > > Hello,> > > > Sorry that I
pickedthis too late in the release cycle, but other work > > prevented my earlier involvement in the testing.> > > >
Thereare at least two serious problems that I discovered so far > > with Postgres > > 6.5 beta2 and beta3 (running on
BSD/OS4.0.1):> > > > 1. LIKE with indexes works worse than without indexes.> >> > Did you built with --enable-locale ?>
>Regards.> > Hiroshi Inoue> Inoue@tpf.co.jp > 
 




Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Bruce Momjian
Date:
> Yes,
> 
> I do build with --enable-locale, but I do not build with --enable-mb and do 
> not use client_encoding or server_encoding.
> 
> The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the 
> environment in both server and client, and this has worked for me in 6.4.2.


This certainly explains it.  With locale enabled, LIKE does not use
indexes because we can't figure out how to do the indexing trick with
non-ASCII character sets because we can't figure out the maximum
character value for a particular encoding.

We didn't do the check in 6.4.*, and LIKE was not returning the proper
results for queries at those sites that used locale.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Daniel Kalchev
Date:
Bruce,

This is extremely bad news, because if will make PostgreSQL 6.5 unusable for 
most of my applications. Perhaps something can be done to resolve this 
situation?

I understand the problem for 16-bit characters support, but for most of the 
encodings that support only 8 bit characters it should be safe to assume the 
maximum character value is 255.

Anyway, making this check compile-time defined would certainly fix things 
here, because in my case the cyrillic letters order match that of the binary 
encoding (that is, the first alphabet letter is before the second etc).

Perhaps the locale data can be used to gather this information?

I will do some testing without using locale to see what happens.

Regards,
Daniel

>>>Bruce Momjian said:> > Yes,> > > > I do build with --enable-locale, but I do not build with --enable-mb and d    o >
>not use client_encoding or server_encoding.> > > > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in
the> > environment in both server and client, and this has worked for me in 6.4.2    .> > > This certainly explains it.
With locale enabled, LIKE does not use> indexes because we can't figure out how to do the indexing trick with>
non-ASCIIcharacter sets because we can't figure out the maximum> character value for a particular encoding.> > We
didn'tdo the check in 6.4.*, and LIKE was not returning the proper> results for queries at those sites that used
locale.>> -- >   Bruce Momjian                        |  http://www.op.net/~candle>   maillist@candle.pha.pa.us
  |  (610) 853-3000>   +  If your life is a hard drive,     |  830 Blythe Avenue>   +  Christ can be your backup.
|  Drexel Hill, Pennsylvania 19026
 




Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Oleg Bartunov
Date:
On Wed, 9 Jun 1999, Bruce Momjian wrote:

> Date: Wed, 9 Jun 1999 08:43:50 -0400 (EDT)
> From: Bruce Momjian <maillist@candle.pha.pa.us>
> To: Daniel Kalchev <daniel@digsys.bg>
> Cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
> 
> > Yes,
> > 
> > I do build with --enable-locale, but I do not build with --enable-mb and do 
> > not use client_encoding or server_encoding.
> > 
> > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the 
> > environment in both server and client, and this has worked for me in 6.4.2.
> 
> 
> This certainly explains it.  With locale enabled, LIKE does not use
> indexes because we can't figure out how to do the indexing trick with
> non-ASCII character sets because we can't figure out the maximum
> character value for a particular encoding.

If so, why explain reports 'Index Scan ....' ?
apod=> \d
Database    = apod+------------------+----------------------------------+----------+|  Owner           |
Relation            |   Type   |+------------------+----------------------------------+----------+| megera           |
idx_adate                       | index    || megera           | idx_atitle                       | index    || megera
        | idx_url                          | index    || megera           | titles                           | table
|+------------------+----------------------------------+----------+

apod=> explain select atitle from titles where atitle like 'Sun%';
NOTICE:  QUERY PLAN:

Index Scan using idx_atitle on titles  (cost=33.28 rows=1 width=12)

Current cvs, --enable-locale --with-mb=KOI8
Regards,
    Oleg
> 
> We didn't do the check in 6.4.*, and LIKE was not returning the proper
> results for queries at those sites that used locale.
> 
> -- 
>   Bruce Momjian                        |  http://www.op.net/~candle
>   maillist@candle.pha.pa.us            |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Oleg Bartunov
Date:
On Wed, 9 Jun 1999, Daniel Kalchev wrote:

> Date: Wed, 09 Jun 1999 16:15:58 +0300
> From: Daniel Kalchev <daniel@digsys.bg>
> To: Bruce Momjian <maillist@candle.pha.pa.us>
> Cc: pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem 
> 
> Bruce,
> 
> This is extremely bad news, because if will make PostgreSQL 6.5 unusable for 
> most of my applications. Perhaps something can be done to resolve this 
> situation?
> 
> I understand the problem for 16-bit characters support, but for most of the 
> encodings that support only 8 bit characters it should be safe to assume the 
> maximum character value is 255.
> 
> Anyway, making this check compile-time defined would certainly fix things 
> here, because in my case the cyrillic letters order match that of the binary 
> encoding (that is, the first alphabet letter is before the second etc).
> 
> Perhaps the locale data can be used to gather this information?

It's certainly there ! locale data contains all information about
specific character set and encoding. Is it possible to use it to create
indices ? It should be slower but benefits of using indices will cover
expenses for non-US people. I didn't notice such behaivour in Informix
and Oracle. Fixing this would be a good point in respect of popularity
of Postgres. Are there any chance to place it in TODO for 6.6 ?
At least explain should reflect such fact !

Regards,
    Oleg

> 
> I will do some testing without using locale to see what happens.
> 
> Regards,
> Daniel
> 
> >>>Bruce Momjian said:
>  > > Yes,
>  > > 
>  > > I do build with --enable-locale, but I do not build with --enable-mb and d
>      o 
>  > > not use client_encoding or server_encoding.
>  > > 
>  > > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the 
>  > > environment in both server and client, and this has worked for me in 6.4.2
>      .
>  > 
>  > 
>  > This certainly explains it.  With locale enabled, LIKE does not use
>  > indexes because we can't figure out how to do the indexing trick with
>  > non-ASCII character sets because we can't figure out the maximum
>  > character value for a particular encoding.
>  > 
>  > We didn't do the check in 6.4.*, and LIKE was not returning the proper
>  > results for queries at those sites that used locale.
>  > 
>  > -- 
>  >   Bruce Momjian                        |  http://www.op.net/~candle
>  >   maillist@candle.pha.pa.us            |  (610) 853-3000
>  >   +  If your life is a hard drive,     |  830 Blythe Avenue
>  >   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 
> 
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> This certainly explains it.  With locale enabled, LIKE does not use
> indexes because we can't figure out how to do the indexing trick with
> non-ASCII character sets because we can't figure out the maximum
> character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison.  In ASCII locale this meansfld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.

The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence.  I had
been thinking about a brute force method: generate a string and
check to see whether strcmp claims it's greater than the original
or not; if not, increment the last byte and try again.  You'd
also have to be able to back up and increment earlier bytes if
you maxed out the last one.  For example, in French locale,fld LIKE 'ab\376%'
you'd first produce 'ab\377' but discover that it's less than
'ab\376' (because \377 is y-dieresis which sorts like 'y').
Your next try must be 'ac\377' which will succeed.

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp.  So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.
        regards, tom lane


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > This certainly explains it.  With locale enabled, LIKE does not use
> > indexes because we can't figure out how to do the indexing trick with
> > non-ASCII character sets because we can't figure out the maximum
> > character value for a particular encoding.
> 
> We don't actually need the *maximum* character value, what we need is
> to be able to generate a *slightly larger* character value.
> 
> For example, what the parser is doing now:
>         fld LIKE 'abc%' ==> fld <= 'abc\377'
> is not even really right in ASCII locale, because it will reject a
> data value like 'abc\377x'.
> 
> I think what we really want is to generate the "next value of the
> same length" and use a < comparison.  In ASCII locale this means
>         fld LIKE 'abc%' ==> fld < 'abd'
> which is reliable regardless of what comes after abc in the data.
> The trick is to figure out a "next" value without assuming a lot
> about the local character set and collation sequence.

in single-byte locales it should be easy:

1. sort a char[256] array from 0-255 using the current locale settings,do it once, either at startup or when first
needed.

2. use binary search on that array to locate the last char before %in this sorted array:if (it is not the last char in
sortedarray)then (replace that char with the one at index+1)else (  if (it is not the first char in like string)  then
(discardthe last char and goto 2.  else (don't do the end restriction))
 

some locales where the string is already sorted may use special 
treatment (ASCII, CYRILLIC) 

> But I am worried whether this trick will work in multibyte locales ---
> incrementing the last byte might generate an invalid character sequence
> and produce unpredictable results from strcmp.  So we need some help
> from someone who knows a lot about collation orders and multibyte
> character representations.

for double-byte locales something similar should work, but getting
the initial array is probably tricky

----------------
Hannu


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Goran Thyni
Date:
It was me who found the bug and supplied the ugly fix, but that was the
only
quick fix we could find until someone figure out how to get the correct
one from
locale.
USE_LOCALE uses strcoll instead of strcmp for comparasion and indexes
and it sorts
correctly but does not work with the MakeIndexable trick since \377
might not
be the highest char in the alphabet or worse might not be a char of the
alphabet at
all (like in sv/fi locales).

The patch gives a slower correct result instead of a fast incorrect one,
which is better IMHO, but maybe I am old fashion. :-)

Anyway I think a correct solution should be:
"a LIKE 'abc%'" should generate:
"a >= 'abc\0' AND a < 'abd'"                        ^
where d is the last char before % + 1, or more correct the next char in
alphabet
after the last char before %.

Still looking for a locale guru. Hello!! :-(

regards,
-- 
-----------------
Göran Thyni
This is Penguin Country. On a quiet night you can hear Windows NT
reboot!


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Bruce Momjian
Date:
> Bruce,
> 
> This is extremely bad news, because if will make PostgreSQL 6.5 unusable for 
> most of my applications. Perhaps something can be done to resolve this 
> situation?
> 
> I understand the problem for 16-bit characters support, but for most of the 
> encodings that support only 8 bit characters it should be safe to assume the 
> maximum character value is 255.
> 
> Anyway, making this check compile-time defined would certainly fix things 
> here, because in my case the cyrillic letters order match that of the binary 
> encoding (that is, the first alphabet letter is before the second etc).
> 
> Perhaps the locale data can be used to gather this information?
> 
> I will do some testing without using locale to see what happens.

The locale check is near the bottom of parser/gram.y, so you can
certainly enable indexing there.

I am told that french does not have 255 as it's max character, and there
there is no collating interface to request the highest character.  I
suppose one hack would be to go out and test all the char values to see
which is highest.
--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Bruce Momjian
Date:
> It's certainly there ! locale data contains all information about
> specific character set and encoding. Is it possible to use it to create
> indices ? It should be slower but benefits of using indices will cover
> expenses for non-US people. I didn't notice such behaivour in Informix
> and Oracle. Fixing this would be a good point in respect of popularity
> of Postgres. Are there any chance to place it in TODO for 6.6 ?
> At least explain should reflect such fact !
> 
> 

Added to TODO:
* Allow indexing of LIKE with localle character sets 

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Bruce Momjian
Date:
Yes. This pretty much sums up the problem.


> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > This certainly explains it.  With locale enabled, LIKE does not use
> > indexes because we can't figure out how to do the indexing trick with
> > non-ASCII character sets because we can't figure out the maximum
> > character value for a particular encoding.
> 
> We don't actually need the *maximum* character value, what we need is
> to be able to generate a *slightly larger* character value.
> 
> For example, what the parser is doing now:
>     fld LIKE 'abc%' ==> fld <= 'abc\377'
> is not even really right in ASCII locale, because it will reject a
> data value like 'abc\377x'.
> 
> I think what we really want is to generate the "next value of the
> same length" and use a < comparison.  In ASCII locale this means
>     fld LIKE 'abc%' ==> fld < 'abd'
> which is reliable regardless of what comes after abc in the data.
> 
> The trick is to figure out a "next" value without assuming a lot
> about the local character set and collation sequence.  I had
> been thinking about a brute force method: generate a string and
> check to see whether strcmp claims it's greater than the original
> or not; if not, increment the last byte and try again.  You'd
> also have to be able to back up and increment earlier bytes if
> you maxed out the last one.  For example, in French locale,
>     fld LIKE 'ab\376%'
> you'd first produce 'ab\377' but discover that it's less than
> 'ab\376' (because \377 is y-dieresis which sorts like 'y').
> Your next try must be 'ac\377' which will succeed.
> 
> But I am worried whether this trick will work in multibyte locales ---
> incrementing the last byte might generate an invalid character sequence
> and produce unpredictable results from strcmp.  So we need some help
> from someone who knows a lot about collation orders and multibyte
> character representations.
> 
>             regards, tom lane
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Daniel Kalchev
Date:
>>>Bruce Momjian said:> The locale check is near the bottom of parser/gram.y, so you can> certainly enable indexing
there.

I commented the two ifdefs that had USE_LOCALE in gram.y and now like uses 
indexes for me...
> I am told that french does not have 255 as it's max character, and there> there is no collating interface to request
thehighest character.  I> suppose one hack would be to go out and test all the char values to see> which is highest.
 

As I understand the current 'non-locale' way the like indexing works, it 
should create this problem for the Bulgarian cyrillic (cp1251) too, because 
character code 255 is valid cyrillic character (the last one from the 
alphabet). Therefore, the solution proposed by Hannu Krosing should be 
implemented.

I believe we can make the assumption, that if you --enable-locale, but do not 
use --with-mb, then you are using single-byte locales and therefore the hack 
might work properly. If you use --with-mb you are out of luck until someone 
explains better how multibyte characters are ordered.

Is there other place than gram.y where this hack needs to be implemented?

While I am bashing the locale support... why the ~* operator does not work 
with locales? That is, I need to use construct like

SELECT key from t WHERE upper(a) ~ upper('somestring');

instead of

SELECT key FROM t WHERE a ~* 'somestring';

Or store everything in the database in uppercase (works for keys) and upper 
the string in the frontend. The result is that this construct does not use 
indices. We also cannot create indexes by upper(a). I believe this was 
outstanding problem in pre-6.4.

I found this later problem to be resolved by modifying the 
backend/regex/makefile to add -funsigned-char to CFLAGS. This is under BSD/OS 
4.0.1 - I found out, that by default characters that are 'alpha' in cyrillic 
are threated by the compiler as negative and therefore isalpha() returns 
zero... I believe this should be fixed as it may be causing other problems 
with non-ASCII locales.

My proposal is to add -funsigned-char to all places where routines such as 
'isalpha' are used, and ifdef it for USE_LOCALE.

Daniel



Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Tatsuo Ishii
Date:
> I believe we can make the assumption, that if you --enable-locale, but do not 
> use --with-mb, then you are using single-byte locales and therefore the hack 
> might work properly. 

I believe at least one person uses both --enable-locale and --with-mb
for single byte locale.

> If you use --with-mb you are out of luck until someone 
> explains better how multibyte characters are ordered.

I think until NATIONAL CHARACTER is fully implemented, we would not be
able to properly handle sort orders with multi-byte characters.
(Thomas, I've been think about implementing NCHAR too!)

So you could do whatever you like as long as they are in #ifndef
MULTIBYTE:-)

> I found this later problem to be resolved by modifying the 
> backend/regex/makefile to add -funsigned-char to CFLAGS. This is under BSD/OS 
> 4.0.1 - I found out, that by default characters that are 'alpha' in cyrillic 
> are threated by the compiler as negative and therefore isalpha() returns 
> zero... I believe this should be fixed as it may be causing other problems 
> with non-ASCII locales.
> 
> My proposal is to add -funsigned-char to all places where routines such as 
> 'isalpha' are used, and ifdef it for USE_LOCALE.

Once I propsed another solution to BSD/OS guy (I don't remember who he
was) and asked him to check if it worked. Unfortunately I got no
answer from him. Then I merged it into #ifdef MULTIBYTE sections in
regcomp.c and it seems work at least for cyrillic locale (Thanks Oleg
for testing!).
---
Tatsuo Ishii



locales and MB (was: Postgres 6.5 beta2 and beta3 problem)

From
Goran Thyni
Date:
Tatsuo Ishii wrote:
> I think until NATIONAL CHARACTER is fully implemented, we would not be
> able to properly handle sort orders with multi-byte characters.
> (Thomas, I've been think about implementing NCHAR too!)
> 
> So you could do whatever you like as long as they are in #ifndef
> MULTIBYTE:-)

Hmm,
and the world is moving towards Unicode.
we definitely need working support for locales and multichar.
Perhaps the right way to do it is to store everything in
som Unicode format internally and convert the output
according to the suggested "per-column-locale-definition".

Larry Wall has hacked UTF8 support into perl so it is doable,
so let see what we can do for 6.6 (time permitting as always).

regards,
Göran


Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From
Oleg Broytmann
Date:
Hi!

On Thu, 10 Jun 1999, Tatsuo Ishii wrote:
> > I believe we can make the assumption, that if you --enable-locale, but do not 
> > use --with-mb, then you are using single-byte locales and therefore the hack 
> > might work properly. 
> 
> I believe at least one person uses both --enable-locale and --with-mb
> for single byte locale.
  I think it's me. But very soon all in Russia will use locale+mb for koi8
and win1251 locales - all of us are always in need to convert between them.
Making Postgres convert on-the-fly is big win.

> Once I propsed another solution to BSD/OS guy (I don't remember who he
> was) and asked him to check if it worked. Unfortunately I got no
> answer from him. Then I merged it into #ifdef MULTIBYTE sections in
  I remember he raised the issue few times, and finally we agreed on
-funsigned-char. I think, he got a working Postgres and locale after all.

> regcomp.c and it seems work at least for cyrillic locale (Thanks Oleg
> for testing!).
  Works now pretty good!

> ---
> Tatsuo Ishii

Oleg.
----    Oleg Broytmann     http://members.xoom.com/phd2/     phd2@earthling.net          Programmers don't die, they
justGOSUB without RETURN.
 



Re: locales and MB (was: Postgres 6.5 beta2 and beta3 problem)

From
Tatsuo Ishii
Date:
>and the world is moving towards Unicode.
>we definitely need working support for locales and multichar.
>Perhaps the right way to do it is to store everything in
>som Unicode format internally and convert the output
>according to the suggested "per-column-locale-definition".

No. There's  nothing perfect in the world. Unicode is not the
exception too. So we need to keep the freedom of choice of the
internal encodings. Currently the mb support allows serveral internal
encodings including Unicode and mule-internal-code.
(yes, you can do regexp/like to Unicode data if mb support is
enabled).
--
Tatsuo Ishii


RE: [HACKERS] locales and MB (was: Postgres 6.5 beta2 and beta3

From
Dmitry Samersoff
Date:
On 11-Jun-99 Goran Thyni wrote:
> Tatsuo Ishii wrote:
>> I think until NATIONAL CHARACTER is fully implemented, we would not be
>> able to properly handle sort orders with multi-byte characters.
>> (Thomas, I've been think about implementing NCHAR too!)
>> 
>> So you could do whatever you like as long as they are in #ifndef
>> MULTIBYTE:-)
> 
> Hmm,
> and the world is moving towards Unicode.
> we definitely need working support for locales and multichar.
> Perhaps the right way to do it is to store everything in
> som Unicode format internally and convert the output
> according to the suggested "per-column-locale-definition".
> 
> Larry Wall has hacked UTF8 support into perl so it is doable,
> so let see what we can do for 6.6 (time permitting as always).

IMHO, also will be pleasent add charset/unicode for database or possible, for
single table at runtime, not for the whole postgres by configure.


---
Dmitry Samersoff, dms@wplus.net, ICQ:3161705
http://devnull.wplus.net
* There will come soft rains ...


Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Currently the mb support allows serveral internal
> encodings including Unicode and mule-internal-code.
> (yes, you can do regexp/like to Unicode data if mb support is
> enabled).

One of the things that bothers me about makeIndexable() is that it
doesn't seem to be multibyte-aware; does it really work in MB case?
        regards, tom lane


> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > Currently the mb support allows serveral internal
> > encodings including Unicode and mule-internal-code.
> > (yes, you can do regexp/like to Unicode data if mb support is
> > enabled).
> 
> One of the things that bothers me about makeIndexable() is that it
> doesn't seem to be multibyte-aware; does it really work in MB case?

Yes. This is because I carefully choose multibyte encodings for
the backend that have following characteristics:

o if the 8th bit of a byte is off then it is a ascii character
o otherwise it is part of non ascii multibyte characters

With these assumptions, makeIndexable() works very well with multibyte
chars.

Not all multibyte encodings satisfy above conditions. For example,
SJIS (an encoding for Japanese) and Big5 (for traditional Chinese)
does not satisfies those requirements. In these encodings the first
byte of the double byte is always 8th bit on. However in second byte
sometimes 8th bit is off: this means we cannot distinguish it from
ascii since it may accidentally matches a bit pattern of an ascii
char. This is why I do not allow SJIS and Big5 as the server
encodings.  Users can use SJIS and Big5 for the client encoding,
however.

You might ask why I don't make makeIndexable() multibyte-aware.  It
definitely possible. But you should know there are many places that
need to be multibyte-aware in this sence. The parser is one of the
good example. Making everything in the backend multibyte-aware is not
worse to do, in my opinion.
---
Tatsuo Ishii


Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Yes. This is because I carefully choose multibyte encodings for
> the backend that have following characteristics:
> o if the 8th bit of a byte is off then it is a ascii character
> o otherwise it is part of non ascii multibyte characters

Ah so.

> You might ask why I don't make makeIndexable() multibyte-aware.  It
> definitely possible. But you should know there are many places that
> need to be multibyte-aware in this sence. The parser is one of the
> good example.

Right, it's much easier to dodge the problem by restricting backend
encodings, and since we have conversions that doesn't hurt anyone.
Now that I think about it, all the explicitly MB-aware code that
I've seen is in frontend stuff.

Thanks for the clue...
        regards, tom lane