Thread: Multibyte in autoconf

Multibyte in autoconf

From
Peter Eisentraut
Date:
This belongs to the chapter "initdb weirdnesses", if you will. I have long
been confused about this, but now I think I have the answer. Could someone
from the multibyte camp please confirm this.

When I configure --with-mb=FOO, the only place FOO is actually used is in
initdb as the default encoding of the database system you are creating
(can be overriden with --pgencoding). The rest of the source code only
does occasional #ifdef MULTIBYTE checks. This sort of arrangement is
questionable for a number of reasons:

1) It's not very clear to the casual observer (=end user). I was lead to
believe that the database system you are compiling will only support the
FOO encoding and I used several --with-mb's if I wanted more.

2) It is very well possible that one initdb instance can be used to
install databases in several locations with varying encodings.

3) I might sound like a broken record, but autoconf is not for controlling
runtime behavior.

While the notion of having a default encoding is perhaps not so bad (but
how often do you do initdb?) it could be introduced via other mechanisms,
such as environment variables. (I am contradicting earlier emails now, but
I'm not sure of a good way myself, yet.) The current approach causes all
kinds of structural hazards in the overall view of things. I propose that
--with-mb be replaced by --enable-mb (how about --enable-multibyte?). This
is nothing urgent, but I would like to know what you think.
-Peter


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden




Re: [HACKERS] Multibyte in autoconf

From
Tatsuo Ishii
Date:
> This belongs to the chapter "initdb weirdnesses", if you will. I have long
> been confused about this, but now I think I have the answer. Could someone
> from the multibyte camp please confirm this.
> 
> When I configure --with-mb=FOO, the only place FOO is actually used is in
> initdb as the default encoding of the database system you are creating
> (can be overriden with --pgencoding). The rest of the source code only
> does occasional #ifdef MULTIBYTE checks. This sort of arrangement is
> questionable for a number of reasons:
> 
> 1) It's not very clear to the casual observer (=end user). I was lead to
> believe that the database system you are compiling will only support the
> FOO encoding and I used several --with-mb's if I wanted more.

Have you ever read doc/README.mb?

> 2) It is very well possible that one initdb instance can be used to
> install databases in several locations with varying encodings.

You can initialize database with specified default encoding by initdb -
e or -pgencoding. What's the problem with this?

> 3) I might sound like a broken record, but autoconf is not for controlling
> runtime behavior.
> 
> While the notion of having a default encoding is perhaps not so bad (but
> how often do you do initdb?) it could be introduced via other mechanisms,
> such as environment variables. (I am contradicting earlier emails now, but
> I'm not sure of a good way myself, yet.) The current approach causes all
> kinds of structural hazards in the overall view of things. I propose that
> --with-mb be replaced by --enable-mb (how about --enable-multibyte?). This
> is nothing urgent, but I would like to know what you think.

I don't understabd why you do not complain about --with-pgport or --
with-maxbackends. Sounds they have same problems as mb:-)

Anyway, I don't like the idea to have an yet another environment
variable to give a default encoding to initdb when -e or -pgencoding
is not specified. We already have enough. Changing --with-mb to --
enable-multibyte seems good but I don't know how to give the default
encoding to initdb in this case. Or just changing --with-mb=FOO to --
enable-multibyte=FOO is what you want?
--
Tatsuo Ishii


Re: [HACKERS] Multibyte in autoconf

From
Peter Eisentraut
Date:
On Tue, 7 Dec 1999, Tatsuo Ishii wrote:

> > 1) It's not very clear to the casual observer (=end user). I was lead to
> > believe that the database system you are compiling will only support the
> > FOO encoding and I used several --with-mb's if I wanted more.
> 
> Have you ever read doc/README.mb?

Yes, and although it is nice, it didn't make this particular part easier
to figure out. I mean, if I configure the compilation of a program with
--with-something=foo, then I assume it actually uses "foo" somehow. And
then I see my compilation actually full of -DMULTIBYTE=XXX lines,
confusing me further.

Btw., why is this not in the main documentation?

> > 2) It is very well possible that one initdb instance can be used to
> > install databases in several locations with varying encodings.
> 
> You can initialize database with specified default encoding by initdb -
> e or -pgencoding. What's the problem with this?

First and foremost, non-obvious, multi-level meta-defaults. You actually
have a default for what initdb chooses as the default encoding. Also,
think about package maintainers. Which one are they going to pick?

> I don't understabd why you do not complain about --with-pgport or --
> with-maxbackends. Sounds they have same problems as mb:-)

Well, I can't complain about everything at once :) Surely, those are more
subtle things, though. The pg_ctl you are working on will pretty much
eliminate the need for those.

> Anyway, I don't like the idea to have an yet another environment
> variable to give a default encoding to initdb when -e or -pgencoding
> is not specified. We alread    y have enough. Changing --with-mb to --

I agree. Considering the fact that in a fairly normal environment you only
initdb once and you only configure once, would it be too far-fetched to
propose moving this sort of decision completely into initdb, that is, make
the --pgencoding mandatory if you do want some encoding? Because I'm also
not completely sure how you would initdb a database without any encoding
whatsoever if you have your initdb set to always use some default.

To be clear: This is not the end of the world, if you think this will be a
major pain for you, then I'll drop it. I just want to know what the
motivation behind this was.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Multibyte in autoconf

From
Tatsuo Ishii
Date:
> > Have you ever read doc/README.mb?
> 
> Yes, and although it is nice, it didn't make this particular part easier
> to figure out. I mean, if I configure the compilation of a program with
> --with-something=foo, then I assume it actually uses "foo" somehow. And
> then I see my compilation actually full of -DMULTIBYTE=XXX lines,
> confusing me further.

I must admit that I'm not good at English and writing:-)

> Btw., why is this not in the main documentation?

Ok, I will do it for 7.0. Please give me some idea to enhance
README.mb (you already gave one) if you have any.

> > Anyway, I don't like the idea to have an yet another environment
> > variable to give a default encoding to initdb when -e or -pgencoding
> > is not specified. We alread    y have enough. Changing --with-mb to --
> 
> I agree. Considering the fact that in a fairly normal environment you only
> initdb once and you only configure once, would it be too far-fetched to
> propose moving this sort of decision completely into initdb, that is, make
> the --pgencoding mandatory if you do want some encoding? Because I'm also
> not completely sure how you would initdb a database without any encoding
> whatsoever if you have your initdb set to always use some default.

I think I see your point. Giving a default-default encoding to initdb
is not a good idea, right?  If so, it comes sounding reasonable to me
too.
--
Tatsuo Ishii


Re: [HACKERS] Multibyte in autoconf

From
Thomas Lockhart
Date:
> > Btw., why is this not in the main documentation?
> Ok, I will do it for 7.0. Please give me some idea to enhance
> README.mb (you already gave one) if you have any.

If this is on the same scale as the basic locale support, it might fit
into doc/src/sgml/config.sgml (to appear in the chapter on
Configuration Options in the Admin's Guide). Or if you want put it
into a separate file (e.g. multibyte.sgml) as either plain text or
with some markup and I'll help to integrate it.

Also, I'll be happy to help edit and adjust language, so don't worry
about the translation details ;)
                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Multibyte in autoconf

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> I agree. Considering the fact that in a fairly normal environment you only
>> initdb once and you only configure once, would it be too far-fetched to
>> propose moving this sort of decision completely into initdb, that is, make
>> the --pgencoding mandatory if you do want some encoding? Because I'm also
>> not completely sure how you would initdb a database without any encoding
>> whatsoever if you have your initdb set to always use some default.

> I think I see your point. Giving a default-default encoding to initdb
> is not a good idea, right?  If so, it comes sounding reasonable to me
> too.

OK, so the proposal is

configure: --enable-mbEnables compilation of MULTIBYTE code, does not select a default

initdb: --pgencoding=FOOEstablishes coding of database; it's an error to specify non-default encoding if MULTIBYTE
wasn'tcompiled.If no --pgencoding, you get default (non-multibyte) coding evenif you compiled with --enable-mb.
 

Seems reasonable and flexible to me.
        regards, tom lane


Re: [HACKERS] Multibyte in autoconf

From
Tatsuo Ishii
Date:
> OK, so the proposal is
> 
> configure: --enable-mb
>     Enables compilation of MULTIBYTE code, does not select a default

Agreed.

> initdb: --pgencoding=FOO
>     Establishes coding of database; it's an error to specify non-
>     default encoding if MULTIBYTE wasn't compiled.

Agreed.

>     If no --pgencoding, you get default (non-multibyte) coding even
>     if you compiled with --enable-mb.

Not agreed. I think it would be better to give an error if no default
encoding is not sepecified if configured with --enable-mb.  Reasons:

1) Users tend to use only one encoding rather than switching multiple
encoding database. Thus major encoding for the user should be properly
set as the default.

2) if non-multibyte coding such as SQL_ASCII is accidently set as the
default, and if a multi-byte user create a database with no encoding
arugument, the result would be a disaster.
--
Tatsuo Ishii



Re: [HACKERS] Multibyte in autoconf

From
Bruce Momjian
Date:
> encoding database. Thus major encoding for the user should be properly
> set as the default.
> 
> 2) if non-multibyte coding such as SQL_ASCII is accidently set as the
> default, and if a multi-byte user create a database with no encoding
> arugument, the result would be a disaster.

Tatsuo, glad you are handling the multi-byte issues.  Most of us are
clueless about it.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Multibyte in autoconf

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> If no --pgencoding, you get default (non-multibyte) coding even
>> if you compiled with --enable-mb.

> Not agreed. I think it would be better to give an error if no default
> encoding is not sepecified if configured with --enable-mb.

OK, I could live with that too.  I think Peter's main point is that
there's no good reason to select a particular encoding at configure
time, even just as a "default".  It'll be less confusing if initdb time
is the *only* time where you specify the particular MULTIBYTE encoding
you want.
        regards, tom lane


Re: [HACKERS] Multibyte in autoconf

From
Peter Eisentraut
Date:
On Wed, 8 Dec 1999, Tatsuo Ishii wrote:

> >     If no --pgencoding, you get default (non-multibyte) coding even
> >     if you compiled with --enable-mb.
> 
> Not agreed. I think it would be better to give an error if no default
> encoding is not sepecified if configured with --enable-mb.  Reasons:
> 
> 1) Users tend to use only one encoding rather than switching multiple
> encoding database. Thus major encoding for the user should be properly
> set as the default.

Users also initdb only once, and that is the time to *choose* what they
want. Then and only then. Once they're done with that they'll never have
to worry about it again.

> 2) if non-multibyte coding such as SQL_ASCII is accidently set as the
> default, and if a multi-byte user create a database with no encoding
> arugument, the result would be a disaster.

Huh, so if I compile my database with multibyte and then I then I choose
to not have a default encoding in template1 but maybe I want to have the
multibyte option available for some other database later on, that will be
a disaster? Not so good.

What I'm also thinking of is the the package maintainer. They should be
able to provide a "neutral" yet multibyte (and locale, and cyrillic)
enabled package, and one should be able to use that even if one doesn't
want to use the multibyte features right now or at all.

Also, it should not be initdb's job to verify that the encodings are
correct, supported, etc. The backend should find that out itself. That
eliminates duplication of the same logic, which the backend can do better
anyway.

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Multibyte in autoconf

From
Tatsuo Ishii
Date:
> > >     If no --pgencoding, you get default (non-multibyte) coding even
> > >     if you compiled with --enable-mb.
> > 
> > Not agreed. I think it would be better to give an error if no default
> > encoding is not sepecified if configured with --enable-mb.  Reasons:
> > 
> > 1) Users tend to use only one encoding rather than switching multiple
> > encoding database. Thus major encoding for the user should be properly
> > set as the default.
> 
> Users also initdb only once, and that is the time to *choose* what they
> want. Then and only then. Once they're done with that they'll never have
> to worry about it again.
> 
> > 2) if non-multibyte coding such as SQL_ASCII is accidently set as the
> > default, and if a multi-byte user create a database with no encoding
> > arugument, the result would be a disaster.
> 
> Huh, so if I compile my database with multibyte and then I then I choose
> to not have a default encoding in template1 but maybe I want to have the
> multibyte option available for some other database later on, that will be
> a disaster? Not so good.

First of all, it's not possible not to have a default encoding in
template1. Probably you mean you choose SQL_ASCII (encoding no. is 0)
as the defaut encoding. Anyway, I'm going to give an example scenario
of the disaster.

1) initdb with no encoding augument (suppose that SQL_ASCII is set as
the default encoding in template1)

2) a user creates a database with no encoding augument. he thought
that the default encoding is EUC_JP.

3) he makes a table then fills it with some Japanese data.

4) later he pulls data from the table and found that it no longer
Japanese!

> What I'm also thinking of is the the package maintainer. They should be
> able to provide a "neutral" yet multibyte (and locale, and cyrillic)
> enabled package, and one should be able to use that even if one doesn't
> want to use the multibyte features right now or at all.

So you think a postgres package with multibyte/locale/cyrillic options
enabled is a good thing for everyone? At least I don't like locale
option. It is not only useless for multibyte languages such as
Japanese, but it makes slow for text comparison. I wouldn't say locale
is useless for everyone, however. I admit it is usefull for single
byte encodings.

I think it would be very hard to make a unified ideal package for
everyone.

> Also, it should not be initdb's job to verify that the encodings are
> correct, supported, etc. The backend should find that out itself. That
> eliminates duplication of the same logic, which the backend can do better
> anyway.

Actually that duplication can be eliminated by using the same
code. I think pg_id command will do the job.

BTW, I don't think the current implmentation of multibyte is not yet
completed.  Next target would be NATIONAL CHARATER support (not sure
it's for 7.0, though).  I would like to find a solution for the
problem of locale I stated above.
--
Tatsuo Ishii


Re: [HACKERS] Multibyte in autoconf

From
Thomas Lockhart
Date:
> BTW, I don't think the current implmentation of multibyte is not yet
> completed.  Next target would be NATIONAL CHARATER support (not sure
> it's for 7.0, though).

I'm still here, interested in working on NATIONAL CHAR and other
character stuff. Will need a multibyte partner though, since I'm not
familiar with all of the issues...
                    - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Multibyte in autoconf

From
Peter Eisentraut
Date:
On 1999-12-08, Tatsuo Ishii mentioned:

> 1) initdb with no encoding augument (suppose that SQL_ASCII is set as
> the default encoding in template1)
> 
> 2) a user creates a database with no encoding augument. he thought
> that the default encoding is EUC_JP.

Why would the user think that? Can't he check if he's not sure? Call his
db admin? Or did the db admin mess up the initdb?

> 
> 3) he makes a table then fills it with some Japanese data.
> 
> 4) later he pulls data from the table and found that it no longer
> Japanese!

That really doesn't have anything to do with what I'm getting at. This is
just a naive user, quite honestly.

> So you think a postgres package with multibyte/locale/cyrillic options
> enabled is a good thing for everyone? At least I don't like locale
> option. It is not only useless for multibyte languages such as
> Japanese, but it makes slow for text comparison. I wouldn't say locale
> is useless for everyone, however. I admit it is usefull for single
> byte encodings.

(Locale doesn't only affect language matters, but also currenct
formatting, number display, etc.)

The performance problems with locale is a deficiency which will get fixed.
But that doesn't mean we have to block this path via other means. But that
was not the point. The point was that what we have here is a default for a
default. And moreover a default for an action you only do once. If you
init a database system, you make then and there (and only there) a
decision what you are going to do, tell your users about it and everyone
is happy. That's not any more complicated than it is now, only that it
moves runtime behaviour to run time programs and leaves build time
decisions with configure time programs.

Now you would do:
./configure --with-mb=FOO
make
make install
initdb

With the proposal you could do:
./configure --enable-multibyte
make
make install

initdb -E FOO # if you want multibyte in all your databases
--or--
initdb        # if you don't want multibyte by default but want             # to keep the option for individual cases

The fact that you have configured with --enable-multibyte doesn't mean you
have to use it. Just because a program is locale capable, doesn't mean you
have to decide on the default locale at compile time.

> I think it would be very hard to make a unified ideal package for
> everyone.

That's what packages try to achieve. We shouldn't make it harder for them.

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden





Re: [HACKERS] Multibyte in autoconf

From
Hiroshi Inoue
Date:

Peter Eisentraut wrote:

> On 1999-12-08, Tatsuo Ishii mentioned:
>
> > 1) initdb with no encoding augument (suppose that SQL_ASCII is set as
> > the default encoding in template1)
> >
> > 2) a user creates a database with no encoding augument. he thought
> > that the default encoding is EUC_JP.
>
> Why would the user think that? Can't he check if he's not sure? Call his
> db admin? Or did the db admin mess up the initdb?
>

As a Japanese,I don't want to specify an encoding for every initdb.
There are few selections except --with-mb=EUC_JP in Japan.
Isn't it preferable that PostgreSQL doesn't need an excellent db
admin ?

I do initdb frequently in current tree.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] Multibyte in autoconf

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> The fact that you have configured with --enable-multibyte doesn't mean you
> have to use it. Just because a program is locale capable, doesn't mean you
> have to decide on the default locale at compile time.

Well, if you don't determine a default locale at configure/compile time,
what that *really* means is that the default was hardwired in even
earlier, ie, when the program was written.  (Or else it means that there
is no default: if we did that, users would be required to explicitly
give an encoding choice whenever they run initdb.)

Seems to me that Tatsuo is right that setting a site-specific default
encoding at configure time is handy, and *also* that Peter is right that
the encoding should be selectable at initdb time.  But where's the
conflict?  We can accept "--with-mb=FOO" at configure time, with the
understanding that the *only* thing FOO is used for is to set the
default value of initdb's --pgencoding switch.  You override FOO by
giving an explicit --pgencoding switch when you do initdb.  People
building generic multibyte-capable RPMs would probably configure with
FOO=ASCII (or whatever the non-multibyte encoding is called).  Seems
like that should satisfy everyone.  Have I missed something?
        regards, tom lane


Re: [HACKERS] Multibyte in autoconf

From
Tatsuo Ishii
Date:
> > BTW, I don't think the current implmentation of multibyte is not yet
> > completed.  Next target would be NATIONAL CHARATER support (not sure
> > it's for 7.0, though).
> 
> I'm still here, interested in working on NATIONAL CHAR and other
> character stuff. Will need a multibyte partner though, since I'm not
> familiar with all of the issues...

Thanks for the offering. I would need help with the parser and some
other staffs too.
--
Tatsuo Ishii


Re: [HACKERS] Multibyte in autoconf

From
Peter Eisentraut
Date:
On 1999-12-09, Hiroshi Inoue mentioned:

> As a Japanese,I don't want to specify an encoding for every initdb.
> There are few selections except --with-mb=EUC_JP in Japan.

As I mentioned frequently, in a normal environment, you configure as many
times as you initdb.

> Isn't it preferable that PostgreSQL doesn't need an excellent db
> admin ?
> 
> I do initdb frequently in current tree.

But you configure and build each time, right?

Anyway, as one who is only getting into multibyte for toyage, I guess I
will drop this topic for now.

-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden