Thread: Multibyte in autoconf
This belongs to the chapter "initdb weirdnesses", if you will. I have long been confused about this, but now I think I have the answer. Could someone from the multibyte camp please confirm this. When I configure --with-mb=FOO, the only place FOO is actually used is in initdb as the default encoding of the database system you are creating (can be overriden with --pgencoding). The rest of the source code only does occasional #ifdef MULTIBYTE checks. This sort of arrangement is questionable for a number of reasons: 1) It's not very clear to the casual observer (=end user). I was lead to believe that the database system you are compiling will only support the FOO encoding and I used several --with-mb's if I wanted more. 2) It is very well possible that one initdb instance can be used to install databases in several locations with varying encodings. 3) I might sound like a broken record, but autoconf is not for controlling runtime behavior. While the notion of having a default encoding is perhaps not so bad (but how often do you do initdb?) it could be introduced via other mechanisms, such as environment variables. (I am contradicting earlier emails now, but I'm not sure of a good way myself, yet.) The current approach causes all kinds of structural hazards in the overall view of things. I propose that --with-mb be replaced by --enable-mb (how about --enable-multibyte?). This is nothing urgent, but I would like to know what you think. -Peter -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
> This belongs to the chapter "initdb weirdnesses", if you will. I have long > been confused about this, but now I think I have the answer. Could someone > from the multibyte camp please confirm this. > > When I configure --with-mb=FOO, the only place FOO is actually used is in > initdb as the default encoding of the database system you are creating > (can be overriden with --pgencoding). The rest of the source code only > does occasional #ifdef MULTIBYTE checks. This sort of arrangement is > questionable for a number of reasons: > > 1) It's not very clear to the casual observer (=end user). I was lead to > believe that the database system you are compiling will only support the > FOO encoding and I used several --with-mb's if I wanted more. Have you ever read doc/README.mb? > 2) It is very well possible that one initdb instance can be used to > install databases in several locations with varying encodings. You can initialize database with specified default encoding by initdb - e or -pgencoding. What's the problem with this? > 3) I might sound like a broken record, but autoconf is not for controlling > runtime behavior. > > While the notion of having a default encoding is perhaps not so bad (but > how often do you do initdb?) it could be introduced via other mechanisms, > such as environment variables. (I am contradicting earlier emails now, but > I'm not sure of a good way myself, yet.) The current approach causes all > kinds of structural hazards in the overall view of things. I propose that > --with-mb be replaced by --enable-mb (how about --enable-multibyte?). This > is nothing urgent, but I would like to know what you think. I don't understabd why you do not complain about --with-pgport or -- with-maxbackends. Sounds they have same problems as mb:-) Anyway, I don't like the idea to have an yet another environment variable to give a default encoding to initdb when -e or -pgencoding is not specified. We already have enough. Changing --with-mb to -- enable-multibyte seems good but I don't know how to give the default encoding to initdb in this case. Or just changing --with-mb=FOO to -- enable-multibyte=FOO is what you want? -- Tatsuo Ishii
On Tue, 7 Dec 1999, Tatsuo Ishii wrote: > > 1) It's not very clear to the casual observer (=end user). I was lead to > > believe that the database system you are compiling will only support the > > FOO encoding and I used several --with-mb's if I wanted more. > > Have you ever read doc/README.mb? Yes, and although it is nice, it didn't make this particular part easier to figure out. I mean, if I configure the compilation of a program with --with-something=foo, then I assume it actually uses "foo" somehow. And then I see my compilation actually full of -DMULTIBYTE=XXX lines, confusing me further. Btw., why is this not in the main documentation? > > 2) It is very well possible that one initdb instance can be used to > > install databases in several locations with varying encodings. > > You can initialize database with specified default encoding by initdb - > e or -pgencoding. What's the problem with this? First and foremost, non-obvious, multi-level meta-defaults. You actually have a default for what initdb chooses as the default encoding. Also, think about package maintainers. Which one are they going to pick? > I don't understabd why you do not complain about --with-pgport or -- > with-maxbackends. Sounds they have same problems as mb:-) Well, I can't complain about everything at once :) Surely, those are more subtle things, though. The pg_ctl you are working on will pretty much eliminate the need for those. > Anyway, I don't like the idea to have an yet another environment > variable to give a default encoding to initdb when -e or -pgencoding > is not specified. We alread y have enough. Changing --with-mb to -- I agree. Considering the fact that in a fairly normal environment you only initdb once and you only configure once, would it be too far-fetched to propose moving this sort of decision completely into initdb, that is, make the --pgencoding mandatory if you do want some encoding? Because I'm also not completely sure how you would initdb a database without any encoding whatsoever if you have your initdb set to always use some default. To be clear: This is not the end of the world, if you think this will be a major pain for you, then I'll drop it. I just want to know what the motivation behind this was. -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
> > Have you ever read doc/README.mb? > > Yes, and although it is nice, it didn't make this particular part easier > to figure out. I mean, if I configure the compilation of a program with > --with-something=foo, then I assume it actually uses "foo" somehow. And > then I see my compilation actually full of -DMULTIBYTE=XXX lines, > confusing me further. I must admit that I'm not good at English and writing:-) > Btw., why is this not in the main documentation? Ok, I will do it for 7.0. Please give me some idea to enhance README.mb (you already gave one) if you have any. > > Anyway, I don't like the idea to have an yet another environment > > variable to give a default encoding to initdb when -e or -pgencoding > > is not specified. We alread y have enough. Changing --with-mb to -- > > I agree. Considering the fact that in a fairly normal environment you only > initdb once and you only configure once, would it be too far-fetched to > propose moving this sort of decision completely into initdb, that is, make > the --pgencoding mandatory if you do want some encoding? Because I'm also > not completely sure how you would initdb a database without any encoding > whatsoever if you have your initdb set to always use some default. I think I see your point. Giving a default-default encoding to initdb is not a good idea, right? If so, it comes sounding reasonable to me too. -- Tatsuo Ishii
> > Btw., why is this not in the main documentation? > Ok, I will do it for 7.0. Please give me some idea to enhance > README.mb (you already gave one) if you have any. If this is on the same scale as the basic locale support, it might fit into doc/src/sgml/config.sgml (to appear in the chapter on Configuration Options in the Admin's Guide). Or if you want put it into a separate file (e.g. multibyte.sgml) as either plain text or with some markup and I'll help to integrate it. Also, I'll be happy to help edit and adjust language, so don't worry about the translation details ;) - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> I agree. Considering the fact that in a fairly normal environment you only >> initdb once and you only configure once, would it be too far-fetched to >> propose moving this sort of decision completely into initdb, that is, make >> the --pgencoding mandatory if you do want some encoding? Because I'm also >> not completely sure how you would initdb a database without any encoding >> whatsoever if you have your initdb set to always use some default. > I think I see your point. Giving a default-default encoding to initdb > is not a good idea, right? If so, it comes sounding reasonable to me > too. OK, so the proposal is configure: --enable-mbEnables compilation of MULTIBYTE code, does not select a default initdb: --pgencoding=FOOEstablishes coding of database; it's an error to specify non-default encoding if MULTIBYTE wasn'tcompiled.If no --pgencoding, you get default (non-multibyte) coding evenif you compiled with --enable-mb. Seems reasonable and flexible to me. regards, tom lane
> OK, so the proposal is > > configure: --enable-mb > Enables compilation of MULTIBYTE code, does not select a default Agreed. > initdb: --pgencoding=FOO > Establishes coding of database; it's an error to specify non- > default encoding if MULTIBYTE wasn't compiled. Agreed. > If no --pgencoding, you get default (non-multibyte) coding even > if you compiled with --enable-mb. Not agreed. I think it would be better to give an error if no default encoding is not sepecified if configured with --enable-mb. Reasons: 1) Users tend to use only one encoding rather than switching multiple encoding database. Thus major encoding for the user should be properly set as the default. 2) if non-multibyte coding such as SQL_ASCII is accidently set as the default, and if a multi-byte user create a database with no encoding arugument, the result would be a disaster. -- Tatsuo Ishii
> encoding database. Thus major encoding for the user should be properly > set as the default. > > 2) if non-multibyte coding such as SQL_ASCII is accidently set as the > default, and if a multi-byte user create a database with no encoding > arugument, the result would be a disaster. Tatsuo, glad you are handling the multi-byte issues. Most of us are clueless about it. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> If no --pgencoding, you get default (non-multibyte) coding even >> if you compiled with --enable-mb. > Not agreed. I think it would be better to give an error if no default > encoding is not sepecified if configured with --enable-mb. OK, I could live with that too. I think Peter's main point is that there's no good reason to select a particular encoding at configure time, even just as a "default". It'll be less confusing if initdb time is the *only* time where you specify the particular MULTIBYTE encoding you want. regards, tom lane
On Wed, 8 Dec 1999, Tatsuo Ishii wrote: > > If no --pgencoding, you get default (non-multibyte) coding even > > if you compiled with --enable-mb. > > Not agreed. I think it would be better to give an error if no default > encoding is not sepecified if configured with --enable-mb. Reasons: > > 1) Users tend to use only one encoding rather than switching multiple > encoding database. Thus major encoding for the user should be properly > set as the default. Users also initdb only once, and that is the time to *choose* what they want. Then and only then. Once they're done with that they'll never have to worry about it again. > 2) if non-multibyte coding such as SQL_ASCII is accidently set as the > default, and if a multi-byte user create a database with no encoding > arugument, the result would be a disaster. Huh, so if I compile my database with multibyte and then I then I choose to not have a default encoding in template1 but maybe I want to have the multibyte option available for some other database later on, that will be a disaster? Not so good. What I'm also thinking of is the the package maintainer. They should be able to provide a "neutral" yet multibyte (and locale, and cyrillic) enabled package, and one should be able to use that even if one doesn't want to use the multibyte features right now or at all. Also, it should not be initdb's job to verify that the encodings are correct, supported, etc. The backend should find that out itself. That eliminates duplication of the same logic, which the backend can do better anyway. -- Peter Eisentraut Sernanders vaeg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
> > > If no --pgencoding, you get default (non-multibyte) coding even > > > if you compiled with --enable-mb. > > > > Not agreed. I think it would be better to give an error if no default > > encoding is not sepecified if configured with --enable-mb. Reasons: > > > > 1) Users tend to use only one encoding rather than switching multiple > > encoding database. Thus major encoding for the user should be properly > > set as the default. > > Users also initdb only once, and that is the time to *choose* what they > want. Then and only then. Once they're done with that they'll never have > to worry about it again. > > > 2) if non-multibyte coding such as SQL_ASCII is accidently set as the > > default, and if a multi-byte user create a database with no encoding > > arugument, the result would be a disaster. > > Huh, so if I compile my database with multibyte and then I then I choose > to not have a default encoding in template1 but maybe I want to have the > multibyte option available for some other database later on, that will be > a disaster? Not so good. First of all, it's not possible not to have a default encoding in template1. Probably you mean you choose SQL_ASCII (encoding no. is 0) as the defaut encoding. Anyway, I'm going to give an example scenario of the disaster. 1) initdb with no encoding augument (suppose that SQL_ASCII is set as the default encoding in template1) 2) a user creates a database with no encoding augument. he thought that the default encoding is EUC_JP. 3) he makes a table then fills it with some Japanese data. 4) later he pulls data from the table and found that it no longer Japanese! > What I'm also thinking of is the the package maintainer. They should be > able to provide a "neutral" yet multibyte (and locale, and cyrillic) > enabled package, and one should be able to use that even if one doesn't > want to use the multibyte features right now or at all. So you think a postgres package with multibyte/locale/cyrillic options enabled is a good thing for everyone? At least I don't like locale option. It is not only useless for multibyte languages such as Japanese, but it makes slow for text comparison. I wouldn't say locale is useless for everyone, however. I admit it is usefull for single byte encodings. I think it would be very hard to make a unified ideal package for everyone. > Also, it should not be initdb's job to verify that the encodings are > correct, supported, etc. The backend should find that out itself. That > eliminates duplication of the same logic, which the backend can do better > anyway. Actually that duplication can be eliminated by using the same code. I think pg_id command will do the job. BTW, I don't think the current implmentation of multibyte is not yet completed. Next target would be NATIONAL CHARATER support (not sure it's for 7.0, though). I would like to find a solution for the problem of locale I stated above. -- Tatsuo Ishii
> BTW, I don't think the current implmentation of multibyte is not yet > completed. Next target would be NATIONAL CHARATER support (not sure > it's for 7.0, though). I'm still here, interested in working on NATIONAL CHAR and other character stuff. Will need a multibyte partner though, since I'm not familiar with all of the issues... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
On 1999-12-08, Tatsuo Ishii mentioned: > 1) initdb with no encoding augument (suppose that SQL_ASCII is set as > the default encoding in template1) > > 2) a user creates a database with no encoding augument. he thought > that the default encoding is EUC_JP. Why would the user think that? Can't he check if he's not sure? Call his db admin? Or did the db admin mess up the initdb? > > 3) he makes a table then fills it with some Japanese data. > > 4) later he pulls data from the table and found that it no longer > Japanese! That really doesn't have anything to do with what I'm getting at. This is just a naive user, quite honestly. > So you think a postgres package with multibyte/locale/cyrillic options > enabled is a good thing for everyone? At least I don't like locale > option. It is not only useless for multibyte languages such as > Japanese, but it makes slow for text comparison. I wouldn't say locale > is useless for everyone, however. I admit it is usefull for single > byte encodings. (Locale doesn't only affect language matters, but also currenct formatting, number display, etc.) The performance problems with locale is a deficiency which will get fixed. But that doesn't mean we have to block this path via other means. But that was not the point. The point was that what we have here is a default for a default. And moreover a default for an action you only do once. If you init a database system, you make then and there (and only there) a decision what you are going to do, tell your users about it and everyone is happy. That's not any more complicated than it is now, only that it moves runtime behaviour to run time programs and leaves build time decisions with configure time programs. Now you would do: ./configure --with-mb=FOO make make install initdb With the proposal you could do: ./configure --enable-multibyte make make install initdb -E FOO # if you want multibyte in all your databases --or-- initdb # if you don't want multibyte by default but want # to keep the option for individual cases The fact that you have configured with --enable-multibyte doesn't mean you have to use it. Just because a program is locale capable, doesn't mean you have to decide on the default locale at compile time. > I think it would be very hard to make a unified ideal package for > everyone. That's what packages try to achieve. We shouldn't make it harder for them. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
Peter Eisentraut wrote: > On 1999-12-08, Tatsuo Ishii mentioned: > > > 1) initdb with no encoding augument (suppose that SQL_ASCII is set as > > the default encoding in template1) > > > > 2) a user creates a database with no encoding augument. he thought > > that the default encoding is EUC_JP. > > Why would the user think that? Can't he check if he's not sure? Call his > db admin? Or did the db admin mess up the initdb? > As a Japanese,I don't want to specify an encoding for every initdb. There are few selections except --with-mb=EUC_JP in Japan. Isn't it preferable that PostgreSQL doesn't need an excellent db admin ? I do initdb frequently in current tree. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Peter Eisentraut <peter_e@gmx.net> writes: > The fact that you have configured with --enable-multibyte doesn't mean you > have to use it. Just because a program is locale capable, doesn't mean you > have to decide on the default locale at compile time. Well, if you don't determine a default locale at configure/compile time, what that *really* means is that the default was hardwired in even earlier, ie, when the program was written. (Or else it means that there is no default: if we did that, users would be required to explicitly give an encoding choice whenever they run initdb.) Seems to me that Tatsuo is right that setting a site-specific default encoding at configure time is handy, and *also* that Peter is right that the encoding should be selectable at initdb time. But where's the conflict? We can accept "--with-mb=FOO" at configure time, with the understanding that the *only* thing FOO is used for is to set the default value of initdb's --pgencoding switch. You override FOO by giving an explicit --pgencoding switch when you do initdb. People building generic multibyte-capable RPMs would probably configure with FOO=ASCII (or whatever the non-multibyte encoding is called). Seems like that should satisfy everyone. Have I missed something? regards, tom lane
> > BTW, I don't think the current implmentation of multibyte is not yet > > completed. Next target would be NATIONAL CHARATER support (not sure > > it's for 7.0, though). > > I'm still here, interested in working on NATIONAL CHAR and other > character stuff. Will need a multibyte partner though, since I'm not > familiar with all of the issues... Thanks for the offering. I would need help with the parser and some other staffs too. -- Tatsuo Ishii
On 1999-12-09, Hiroshi Inoue mentioned: > As a Japanese,I don't want to specify an encoding for every initdb. > There are few selections except --with-mb=EUC_JP in Japan. As I mentioned frequently, in a normal environment, you configure as many times as you initdb. > Isn't it preferable that PostgreSQL doesn't need an excellent db > admin ? > > I do initdb frequently in current tree. But you configure and build each time, right? Anyway, as one who is only getting into multibyte for toyage, I guess I will drop this topic for now. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden