Thread: Implications of multi-byte support in a distribution
I have had a request to add multi-byte support to the Debian binary packages of PostgreSQL. Since I live in England, I have personally no need of this and therefore have little understanding of the implications. If I change the packages to use multi-byte support, (UNICODE (UTF-8) is suggested as the default), will there be any detrimental effects on the fairly large parts of the world that don't need it? Should I try to provide two different packages, one with and one without MB support? -- Vote against SPAM: http://www.politik-digital.de/spam/ ======================================== Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID32B8FAA1 ======================================== "For what shall it profit a man, if he shall gain the whole world, and lose his own soul?" Mark 8:36
> I have had a request to add multi-byte support to the Debian binary > packages of PostgreSQL. > Since I live in England, I have personally no need of this and therefore > have little understanding of the implications. > If I change the packages to use multi-byte support, (UNICODE (UTF-8) is > suggested as the default), will there be any detrimental effects on the > fairly large parts of the world that don't need it? Should I try to > provide two different packages, one with and one without MB support? Probably. The downside to having MB support is reduced performance and perhaps functionality. If you don't need it, don't build it... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
On Mon, 30 Aug 1999, Oliver Elphick wrote: > I have had a request to add multi-byte support to the Debian binary > packages of PostgreSQL. > > Since I live in England, I have personally no need of this and therefore > have little understanding of the implications. > > If I change the packages to use multi-byte support, (UNICODE (UTF-8) is I consider Unicode as a compromise, and as such, it is the worst case. I don't know anyone who need Unicode directly. Russian users need koi8 and win1251, Chineese, Japaneese and other folks need their apropriate encodings (BIG5 and all that). Don't know what should be reasonable default; in any case installation script should ask about user preference and run initdb -E with user encoding to set default. > suggested as the default), will there be any detrimental effects on the > fairly large parts of the world that don't need it? Should I try to > provide two different packages, one with and one without MB support? But of course. Many people do not want MB support out of distributive. Suspicious sysadmin should reject such package, if (s)he do not understand what/where/why MB - and it is right. Suporting two different packages is hard, but support only MB-enabled package will led to many demands "please provide smaller/better/faster PostgreSQL package". Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
>> I have had a request to add multi-byte support to the Debian binary >> packages of PostgreSQL. >> Since I live in England, I have personally no need of this and therefore >> have little understanding of the implications. >> If I change the packages to use multi-byte support, (UNICODE (UTF-8) is >> suggested as the default), will there be any detrimental effects on the >> fairly large parts of the world that don't need it? Should I try to >> provide two different packages, one with and one without MB support? > >Probably. The downside to having MB support is reduced performance and >perhaps functionality. If you don't need it, don't build it... Not really. I did the regression test with/without multi-byte enabled. with MB: 2:53:92 elapsed w/o MB: 2:52.92 elapsed Perhaps the worst case for MB would be regex ops. If you do a lot of regex queries, performance degration might not be neglectable. Load module size: with MB: 1208542 w/o MB: 1190925 (difference is 17KB) Talking about the functionality, I don't see any missing feature with MB comparing w/o MB. (there are some features only MB has. for example, SET NAMES). -- Tatsuo Ishii
On Tue, 31 Aug 1999, Tatsuo Ishii wrote: > Date: Tue, 31 Aug 1999 18:29:21 +0900 > From: Tatsuo Ishii <t-ishii@sra.co.jp> > To: Thomas Lockhart <lockhart@alumni.caltech.edu> > Cc: Oliver Elphick <olly@lfix.co.uk>, hackers@postgresql.org, > 43702@bugs.debian.org > Subject: Re: [HACKERS] Implications of multi-byte support in a distribution > > >> I have had a request to add multi-byte support to the Debian binary > >> packages of PostgreSQL. > >> Since I live in England, I have personally no need of this and therefore > >> have little understanding of the implications. > >> If I change the packages to use multi-byte support, (UNICODE (UTF-8) is > >> suggested as the default), will there be any detrimental effects on the > >> fairly large parts of the world that don't need it? Should I try to > >> provide two different packages, one with and one without MB support? > > > >Probably. The downside to having MB support is reduced performance and > >perhaps functionality. If you don't need it, don't build it... > > Not really. I did the regression test with/without multi-byte enabled. > > with MB: 2:53:92 elapsed > w/o MB: 2:52.92 elapsed > > Perhaps the worst case for MB would be regex ops. If you do a lot of > regex queries, performance degration might not be neglectable. It should be. What would be nice is to have a column-specific MB support. But I doubt if it's possible. > > Load module size: > > with MB: 1208542 > w/o MB: 1190925 > > (difference is 17KB) > > Talking about the functionality, I don't see any missing feature with > MB comparing w/o MB. (there are some features only MB has. for > example, SET NAMES). > -- > Tatsuo Ishii > > ************ > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
>> Perhaps the worst case for MB would be regex ops. If you do a lot of >> regex queries, performance degration might not be neglectable. > >It should be. What would be nice is to have a column-specific >MB support. But I doubt if it's possible. That shouldn't be too difficult, if we have an encoding infomation with each text column or literal. Maybe now is the time to introuce NCHAR? BTW, it is interesting that people does not hesitate to enable with-locale option even if they only use ASCII. I guess the performance degration by enabling locale is not too small. -- Tatsuo Ishii
> That shouldn't be too difficult, if we have an encoding infomation > with each text column or literal. Maybe now is the time to introuce > NCHAR? I've been waiting for a go-ahead from folks who would use it. imho the way to do it is to use Postgres' type system to implement it, rather than, for example, encoding "type" information into each string. We can also define a "default encoding" for each database as a new column in pg_database... > BTW, it is interesting that people does not hesitate to enable > with-locale option even if they only use ASCII. I guess the > performance degration by enabling locale is not too small. Red Hat built their RPMs with locale enabled, and there is a significant performance hit. Implementing NCHAR would be a better solution, since the user can choose whether to use SQL_TEXT or the locale-specific character set at run time... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
On Wed, 1 Sep 1999, Thomas Lockhart wrote: > Date: Wed, 01 Sep 1999 02:55:48 +0000 > From: Thomas Lockhart <lockhart@alumni.caltech.edu> > To: t-ishii@sra.co.jp > Cc: Oleg Bartunov <oleg@sai.msu.su>, Oliver Elphick <olly@lfix.co.uk>, > hackers@postgresql.org, 43702@bugs.debian.org > Subject: Re: [HACKERS] Implications of multi-byte support in a distribution > > > That shouldn't be too difficult, if we have an encoding infomation > > with each text column or literal. Maybe now is the time to introuce > > NCHAR? Yes, postgres after 6.5 and especially recent win becomes very popular and additional performance hit would be very in time. Does implementing of NCHAR only could solve all problem with text, varchar etc ? > > I've been waiting for a go-ahead from folks who would use it. imho the > way to do it is to use Postgres' type system to implement it, rather > than, for example, encoding "type" information into each string. We > can also define a "default encoding" for each database as a new column > in pg_database... go-ahead, Tom :-) I would use it. > > > BTW, it is interesting that people does not hesitate to enable > > with-locale option even if they only use ASCII. I guess the > > performance degration by enabling locale is not too small. > > Red Hat built their RPMs with locale enabled, and there is a > significant performance hit. Implementing NCHAR would be a better > solution, since the user can choose whether to use SQL_TEXT or the > locale-specific character set at run time... > > - Thomas > > -- > Thomas Lockhart lockhart@alumni.caltech.edu > South Pasadena, California > > ************ > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
>>>>> "TL" == Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >> That shouldn't be too difficult, if we have an encoding >> infomation with each text column or literal. Maybe nowis the >> time to introuce NCHAR? TL> I've been waiting for a go-ahead from folks who would use TL> it. imho the way to do it is to use Postgres' typesystem to TL> implement it, rather than, for example, encoding "type" TL> information into each string. We can alsodefine a "default TL> encoding" for each database as a new column in pg_database... What about sorting? Would it be possible to solve it in similar way? If I'm not mistaken, there is currently no good way to use two different kinds of sorting for one postmaster instance? Milan Zamazal
>>>>> "TL" == Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >> That shouldn't be too difficult, if we have an encoding >> infomation with each text column or literal. Maybe nowis the >> time to introuce NCHAR? TL> I've been waiting for a go-ahead from folks who would use TL> it. imho the way to do it is to use Postgres' typesystem to TL> implement it, rather than, for example, encoding "type" TL> information into each string. We can alsodefine a "default TL> encoding" for each database as a new column in pg_database... What about sorting? Would it be possible to solve it in similar way? If I'm not mistaken, there is currently no good way to use two different kinds of sorting for one postmaster instance? Milan Zamazal
> >> That shouldn't be too difficult, if we have an encoding > >> infomation with each text column or literal. Maybe now is the > >> time to introuce NCHAR? > TL> I've been waiting for a go-ahead from folks who would use > TL> it. imho the way to do it is to use Postgres' type system to > TL> implement it, rather than, for example, encoding "type" > TL> information into each string. We can also define a "default > TL> encoding" for each database as a new column in pg_database... > What about sorting? Would it be possible to solve it in similar way? > If I'm not mistaken, there is currently no good way to use two different > kinds of sorting for one postmaster instance? Each encoding/character set can behave however you want. You can reuse collation and sorting code from another character set, or define a unique one. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart wrote: > > > >> That shouldn't be too difficult, if we have an encoding > > >> infomation with each text column or literal. Maybe now is the > > >> time to introuce NCHAR? > > TL> I've been waiting for a go-ahead from folks who would use > > TL> it. imho the way to do it is to use Postgres' type system to > > TL> implement it, rather than, for example, encoding "type" > > TL> information into each string. We can also define a "default > > TL> encoding" for each database as a new column in pg_database... > > What about sorting? Would it be possible to solve it in similar way? > > If I'm not mistaken, there is currently no good way to use two different > > kinds of sorting for one postmaster instance? > > Each encoding/character set can behave however you want. You can reuse > collation and sorting code from another character set, or define a > unique one. Is it really inside one postmaster instance ? If so, then is the character encoding defined at the create table / create index process (maybe even separately for each field ?) or can I specify it when sort'ing ? ----------------- Hannu
> > Each encoding/character set can behave however you want. You can reuse > > collation and sorting code from another character set, or define a > > unique one. > Is it really inside one postmaster instance ? > If so, then is the character encoding defined at the create table / > create index process (maybe even separately for each field ?) or can I > specify it when sort'ing ? Yes, yes, and yes ;) I would propose that we implement the explicit collation features of SQL92 using implicit type conversion. So if you want to use a different sorting order on a *compatible* character set, then (looking up in Date and Darwen for the syntax...): 'test string' COLLATE CASE_INSENSITIVITY becomes internally case_insensitivity('test string'::text) and c1 < c2 COLLATE CASE_INSENSITIVITY becomes case_insensitivity(c1) < case_insensitivity(c2) - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> > > Each encoding/character set can behave however you want. You can reuse > > > collation and sorting code from another character set, or define a > > > unique one. > > Is it really inside one postmaster instance ? > > If so, then is the character encoding defined at the create table / > > create index process (maybe even separately for each field ?) or can I > > specify it when sort'ing ? > > Yes, yes, and yes ;) But we can't avoid calling strcoll() and some other codes surrounded by #ifdef LOCALE? I think he actually wants is to define his own collation *and* not to use locale if the column is ASCII only. > I would propose that we implement the explicit collation features of > SQL92 using implicit type conversion. So if you want to use a > different sorting order on a *compatible* character set, then (looking > up in Date and Darwen for the syntax...): > > 'test string' COLLATE CASE_INSENSITIVITY > > becomes internally > > case_insensitivity('test string'::text) > > and > > c1 < c2 COLLATE CASE_INSENSITIVITY > > becomes > > case_insensitivity(c1) < case_insensitivity(c2) This idea seems great and elegant. Ok, what about throwing away #ifdef LOCALE? Same thing can be obtained by defining a special callation LOCALE_AWARE. This seems much more consistent for me. Or even better, we could explicitly have predefined COLLATION for each language (these can be automatically generated from existing locale data). This would avoid some platform specific locale problems. --- Tatsuo Ishii
> But we can't avoid calling strcoll() and some other codes surrounded > by #ifdef LOCALE? I think he actually wants is to define his own > collation *and* not to use locale if the column is ASCII only. Right. But there would be a fundamental character type which is *not* locale-aware, and there is another type (perhaps/probably NCHAR?) which is... > Ok, what about throwing away #ifdef > LOCALE? Same thing can be obtained by defining a special callation > LOCALE_AWARE. Or moving the locale-aware stuff to a formal NCHAR implementation. istm (and to Date and Darwen ;) that there is a tighter relationship between collations, character repertoires, and character sets than might be inferred from the SQL92-defined capabilities. > This seems much more consistent for me. Or even better, > we could explicitly have predefined COLLATION for each language (these > can be automatically generated from existing locale data). This would > avoid some platform specific locale problems. Right. We may already have some of this with the "implicit type coersion" conventions I introduced in the v6.4 release. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California