Thread: Implications of multi-byte support in a distribution

Implications of multi-byte support in a distribution

From
"Oliver Elphick"
Date:
I have had a request to add multi-byte support to the Debian binary
packages of PostgreSQL.

Since I live in England, I have personally no need of this and therefore
have little understanding of the implications.

If I change the packages to use multi-byte support, (UNICODE (UTF-8) is
suggested as the default), will there be any detrimental effects on the
fairly large parts of the world that don't need it?  Should I try to
provide two different packages, one with and one without MB support?

--      Vote against SPAM: http://www.politik-digital.de/spam/                ========================================
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver              PGP key from public servers; key
ID32B8FAA1                ========================================    "For what shall it profit a man, if he shall gain
the     whole world, and lose his own soul?"      Mark 8:36 
 




Re: [HACKERS] Implications of multi-byte support in a distribution

From
Thomas Lockhart
Date:
> I have had a request to add multi-byte support to the Debian binary
> packages of PostgreSQL.
> Since I live in England, I have personally no need of this and therefore
> have little understanding of the implications.
> If I change the packages to use multi-byte support, (UNICODE (UTF-8) is
> suggested as the default), will there be any detrimental effects on the
> fairly large parts of the world that don't need it?  Should I try to
> provide two different packages, one with and one without MB support?

Probably. The downside to having MB support is reduced performance and
perhaps functionality. If you don't need it, don't build it...
                    - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Oleg Broytmann
Date:
On Mon, 30 Aug 1999, Oliver Elphick wrote:
> I have had a request to add multi-byte support to the Debian binary
> packages of PostgreSQL.
> 
> Since I live in England, I have personally no need of this and therefore
> have little understanding of the implications.
> 
> If I change the packages to use multi-byte support, (UNICODE (UTF-8) is
  I consider Unicode as a compromise, and as such, it is the worst case. I
don't know anyone who need Unicode directly. Russian users need koi8 and
win1251, Chineese, Japaneese and other folks need their apropriate
encodings (BIG5 and all that).  Don't know what should be reasonable default; in any case installation
script should ask about user preference and run initdb -E with user
encoding to set default.

> suggested as the default), will there be any detrimental effects on the
> fairly large parts of the world that don't need it?  Should I try to
> provide two different packages, one with and one without MB support?
  But of course. Many people do not want MB support out of distributive.
Suspicious sysadmin should reject such package, if (s)he do not understand
what/where/why MB - and it is right.  Suporting two different packages is hard, but support only MB-enabled
package will led to many demands "please provide smaller/better/faster
PostgreSQL package".

Oleg.
----    Oleg Broytmann     http://members.xoom.com/phd2/     phd2@earthling.net          Programmers don't die, they
justGOSUB without RETURN.
 



Re: [HACKERS] Implications of multi-byte support in a distribution

From
Tatsuo Ishii
Date:
>> I have had a request to add multi-byte support to the Debian binary
>> packages of PostgreSQL.
>> Since I live in England, I have personally no need of this and therefore
>> have little understanding of the implications.
>> If I change the packages to use multi-byte support, (UNICODE (UTF-8) is
>> suggested as the default), will there be any detrimental effects on the
>> fairly large parts of the world that don't need it?  Should I try to
>> provide two different packages, one with and one without MB support?
>
>Probably. The downside to having MB support is reduced performance and
>perhaps functionality. If you don't need it, don't build it...

Not really. I did the regression test with/without multi-byte enabled.

with MB:    2:53:92 elapsed
w/o MB:        2:52.92 elapsed

Perhaps the worst case for MB would be regex ops. If you do a lot of
regex queries, performance degration might not be neglectable.

Load module size:

with MB:    1208542
w/o MB:        1190925

(difference is 17KB)

Talking about the functionality, I don't see any missing feature with
MB comparing w/o MB. (there are some features only MB has. for
example, SET NAMES).
--
Tatsuo Ishii


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Oleg Bartunov
Date:
On Tue, 31 Aug 1999, Tatsuo Ishii wrote:

> Date: Tue, 31 Aug 1999 18:29:21 +0900
> From: Tatsuo Ishii <t-ishii@sra.co.jp>
> To: Thomas Lockhart <lockhart@alumni.caltech.edu>
> Cc: Oliver Elphick <olly@lfix.co.uk>, hackers@postgresql.org,
>     43702@bugs.debian.org
> Subject: Re: [HACKERS] Implications of multi-byte support in a distribution 
> 
> >> I have had a request to add multi-byte support to the Debian binary
> >> packages of PostgreSQL.
> >> Since I live in England, I have personally no need of this and therefore
> >> have little understanding of the implications.
> >> If I change the packages to use multi-byte support, (UNICODE (UTF-8) is
> >> suggested as the default), will there be any detrimental effects on the
> >> fairly large parts of the world that don't need it?  Should I try to
> >> provide two different packages, one with and one without MB support?
> >
> >Probably. The downside to having MB support is reduced performance and
> >perhaps functionality. If you don't need it, don't build it...
> 
> Not really. I did the regression test with/without multi-byte enabled.
> 
> with MB:    2:53:92 elapsed
> w/o MB:        2:52.92 elapsed
> 
> Perhaps the worst case for MB would be regex ops. If you do a lot of
> regex queries, performance degration might not be neglectable.

It should be. What would be nice is to have a column-specific
MB support. But I doubt if it's possible.

> 
> Load module size:
> 
> with MB:    1208542
> w/o MB:        1190925
> 
> (difference is 17KB)
> 
> Talking about the functionality, I don't see any missing feature with
> MB comparing w/o MB. (there are some features only MB has. for
> example, SET NAMES).
> --
> Tatsuo Ishii
> 
> ************
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] Implications of multi-byte support in a distribution

From
Tatsuo Ishii
Date:
>> Perhaps the worst case for MB would be regex ops. If you do a lot of
>> regex queries, performance degration might not be neglectable.
>
>It should be. What would be nice is to have a column-specific
>MB support. But I doubt if it's possible.

That shouldn't be too difficult, if we have an encoding infomation
with each text column or literal. Maybe now is the time to introuce
NCHAR?

BTW, it is interesting that people does not hesitate to enable
with-locale option even if they only use ASCII. I guess the
performance degration by enabling locale is not too small.
--
Tatsuo Ishii


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Thomas Lockhart
Date:
> That shouldn't be too difficult, if we have an encoding infomation
> with each text column or literal. Maybe now is the time to introuce
> NCHAR?

I've been waiting for a go-ahead from folks who would use it. imho the
way to do it is to use Postgres' type system to implement it, rather
than, for example, encoding "type" information into each string. We
can also define a "default encoding" for each database as a new column
in pg_database...

> BTW, it is interesting that people does not hesitate to enable
> with-locale option even if they only use ASCII. I guess the
> performance degration by enabling locale is not too small.

Red Hat built their RPMs with locale enabled, and there is a
significant performance hit. Implementing NCHAR would be a better
solution, since the user can choose whether to use SQL_TEXT or the
locale-specific character set at run time...
                    - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Oleg Bartunov
Date:
On Wed, 1 Sep 1999, Thomas Lockhart wrote:

> Date: Wed, 01 Sep 1999 02:55:48 +0000
> From: Thomas Lockhart <lockhart@alumni.caltech.edu>
> To: t-ishii@sra.co.jp
> Cc: Oleg Bartunov <oleg@sai.msu.su>, Oliver Elphick <olly@lfix.co.uk>,
>     hackers@postgresql.org, 43702@bugs.debian.org
> Subject: Re: [HACKERS] Implications of multi-byte support in a distribution
> 
> > That shouldn't be too difficult, if we have an encoding infomation
> > with each text column or literal. Maybe now is the time to introuce
> > NCHAR?

Yes, postgres after 6.5 and especially recent win becomes very popular
and additional performance hit would be very in time. Does implementing
of NCHAR only could solve all problem  with text, varchar etc ?

> 
> I've been waiting for a go-ahead from folks who would use it. imho the
> way to do it is to use Postgres' type system to implement it, rather
> than, for example, encoding "type" information into each string. We
> can also define a "default encoding" for each database as a new column
> in pg_database...

go-ahead, Tom :-) I would use it.


> 
> > BTW, it is interesting that people does not hesitate to enable
> > with-locale option even if they only use ASCII. I guess the
> > performance degration by enabling locale is not too small.
> 
> Red Hat built their RPMs with locale enabled, and there is a
> significant performance hit. Implementing NCHAR would be a better
> solution, since the user can choose whether to use SQL_TEXT or the
> locale-specific character set at run time...
> 
>                      - Thomas
> 
> -- 
> Thomas Lockhart                lockhart@alumni.caltech.edu
> South Pasadena, California
> 
> ************
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: [HACKERS] Implications of multi-byte support in a distribution

From
Milan Zamazal
Date:
>>>>> "TL" == Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
   >> That shouldn't be too difficult, if we have an encoding   >> infomation with each text column or literal. Maybe
nowis the   >> time to introuce NCHAR?
 
   TL> I've been waiting for a go-ahead from folks who would use   TL> it. imho the way to do it is to use Postgres'
typesystem to   TL> implement it, rather than, for example, encoding "type"   TL> information into each string. We can
alsodefine a "default   TL> encoding" for each database as a new column in pg_database...
 

What about sorting?  Would it be possible to solve it in similar way?
If I'm not mistaken, there is currently no good way to use two different
kinds of sorting for one postmaster instance?

Milan Zamazal


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Milan Zamazal
Date:
>>>>> "TL" == Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
   >> That shouldn't be too difficult, if we have an encoding   >> infomation with each text column or literal. Maybe
nowis the   >> time to introuce NCHAR?
 
   TL> I've been waiting for a go-ahead from folks who would use   TL> it. imho the way to do it is to use Postgres'
typesystem to   TL> implement it, rather than, for example, encoding "type"   TL> information into each string. We can
alsodefine a "default   TL> encoding" for each database as a new column in pg_database...
 

What about sorting?  Would it be possible to solve it in similar way?
If I'm not mistaken, there is currently no good way to use two different
kinds of sorting for one postmaster instance?

Milan Zamazal


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Thomas Lockhart
Date:
>     >> That shouldn't be too difficult, if we have an encoding
>     >> infomation with each text column or literal. Maybe now is the
>     >> time to introuce NCHAR?
>     TL> I've been waiting for a go-ahead from folks who would use
>     TL> it. imho the way to do it is to use Postgres' type system to
>     TL> implement it, rather than, for example, encoding "type"
>     TL> information into each string. We can also define a "default
>     TL> encoding" for each database as a new column in pg_database...
> What about sorting?  Would it be possible to solve it in similar way?
> If I'm not mistaken, there is currently no good way to use two different
> kinds of sorting for one postmaster instance?

Each encoding/character set can behave however you want. You can reuse
collation and sorting code from another character set, or define a
unique one.
                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Hannu Krosing
Date:
Thomas Lockhart wrote:
> 
> >     >> That shouldn't be too difficult, if we have an encoding
> >     >> infomation with each text column or literal. Maybe now is the
> >     >> time to introuce NCHAR?
> >     TL> I've been waiting for a go-ahead from folks who would use
> >     TL> it. imho the way to do it is to use Postgres' type system to
> >     TL> implement it, rather than, for example, encoding "type"
> >     TL> information into each string. We can also define a "default
> >     TL> encoding" for each database as a new column in pg_database...
> > What about sorting?  Would it be possible to solve it in similar way?
> > If I'm not mistaken, there is currently no good way to use two different
> > kinds of sorting for one postmaster instance?
> 
> Each encoding/character set can behave however you want. You can reuse
> collation and sorting code from another character set, or define a
> unique one.

Is it really inside one postmaster instance ?

If so, then is the character encoding defined at the create table /
create index 
process (maybe even separately for each field ?) or can I specify it
when sort'ing ?

-----------------
Hannu


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Thomas Lockhart
Date:
> > Each encoding/character set can behave however you want. You can reuse
> > collation and sorting code from another character set, or define a
> > unique one.
> Is it really inside one postmaster instance ?
> If so, then is the character encoding defined at the create table /
> create index process (maybe even separately for each field ?) or can I 
> specify it when sort'ing ?

Yes, yes, and yes ;)

I would propose that we implement the explicit collation features of
SQL92 using implicit type conversion. So if you want to use a
different sorting order on a *compatible* character set, then (looking
up in Date and Darwen for the syntax...):
 'test string' COLLATE CASE_INSENSITIVITY

becomes internally
 case_insensitivity('test string'::text)

and
 c1 < c2 COLLATE CASE_INSENSITIVITY

becomes
 case_insensitivity(c1) < case_insensitivity(c2)
                            - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Tatsuo Ishii
Date:
> > > Each encoding/character set can behave however you want. You can reuse
> > > collation and sorting code from another character set, or define a
> > > unique one.
> > Is it really inside one postmaster instance ?
> > If so, then is the character encoding defined at the create table /
> > create index process (maybe even separately for each field ?) or can I 
> > specify it when sort'ing ?
> 
> Yes, yes, and yes ;)

But we can't avoid calling strcoll() and some other codes surrounded
by #ifdef LOCALE? I think he actually wants is to define his own
collation *and* not to use locale if the column is ASCII only.

> I would propose that we implement the explicit collation features of
> SQL92 using implicit type conversion. So if you want to use a
> different sorting order on a *compatible* character set, then (looking
> up in Date and Darwen for the syntax...):
> 
>   'test string' COLLATE CASE_INSENSITIVITY
> 
> becomes internally
> 
>   case_insensitivity('test string'::text)
> 
> and
> 
>   c1 < c2 COLLATE CASE_INSENSITIVITY
> 
> becomes
> 
>   case_insensitivity(c1) < case_insensitivity(c2)

This idea seems great and elegant. Ok, what about throwing away #ifdef
LOCALE? Same thing can be obtained by defining a special callation
LOCALE_AWARE. This seems much more consistent for me.  Or even better,
we could explicitly have predefined COLLATION for each language (these
can be automatically generated from existing locale data). This would
avoid some platform specific locale problems.
---
Tatsuo Ishii


Re: [HACKERS] Implications of multi-byte support in a distribution

From
Thomas Lockhart
Date:
> But we can't avoid calling strcoll() and some other codes surrounded
> by #ifdef LOCALE? I think he actually wants is to define his own
> collation *and* not to use locale if the column is ASCII only.

Right. But there would be a fundamental character type which is *not*
locale-aware, and there is another type (perhaps/probably NCHAR?)
which is...

> Ok, what about throwing away #ifdef
> LOCALE? Same thing can be obtained by defining a special callation
> LOCALE_AWARE.

Or moving the locale-aware stuff to a formal NCHAR implementation.
istm (and to Date and Darwen ;) that there is a tighter relationship
between collations, character repertoires, and character sets than
might be inferred from the SQL92-defined capabilities.

> This seems much more consistent for me.  Or even better,
> we could explicitly have predefined COLLATION for each language (these
> can be automatically generated from existing locale data). This would
> avoid some platform specific locale problems.

Right. We may already have some of this with the "implicit type
coersion" conventions I introduced in the v6.4 release.
                      - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California