Thread: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tom Lane
Date:
We now have defenses against running a non-LOCALE-enabled backend in a
database that was created in non-C locale.  Shouldn't we likewise
prevent a non-MULTIBYTE-enabled backend from running in a database with
a multibyte encoding that's not SQL_ASCII?  Or am I missing a reason why
that is safe?

I propose the following addition to ReverifyMyDatabase in postinit.c:
 #ifdef MULTIBYTE     SetDatabaseEncoding(dbform->encoding);
+ #else
+    if (dbform->encoding != SQL_ASCII)
+        elog(FATAL, "some suitable error message"); #endif

Comments?
        regards, tom lane


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Peter Eisentraut
Date:
Tom Lane writes:

> We now have defenses against running a non-LOCALE-enabled backend in a
> database that was created in non-C locale.  Shouldn't we likewise
> prevent a non-MULTIBYTE-enabled backend from running in a database with
> a multibyte encoding that's not SQL_ASCII?  Or am I missing a reason why
> that is safe?

Not all multibyte encodings are actually "multi"-byte, e.g., LATIN2.  In
that case the main benefit is the on-the-fly recoding between the client
and the server.  If a non-MB server encounters that database it should
still work.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> We now have defenses against running a non-LOCALE-enabled backend in a
>> database that was created in non-C locale.  Shouldn't we likewise
>> prevent a non-MULTIBYTE-enabled backend from running in a database with
>> a multibyte encoding that's not SQL_ASCII?  Or am I missing a reason why
>> that is safe?

> Not all multibyte encodings are actually "multi"-byte, e.g., LATIN2.  In
> that case the main benefit is the on-the-fly recoding between the client
> and the server.  If a non-MB server encounters that database it should
> still work.

Are these encodings all guaranteed to have the same collation order as
SQL_ASCII?  If not, we have the same index corruption issues as for LOCALE.
        regards, tom lane


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tatsuo Ishii
Date:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Tom Lane writes:
> >> We now have defenses against running a non-LOCALE-enabled backend in a
> >> database that was created in non-C locale.  Shouldn't we likewise
> >> prevent a non-MULTIBYTE-enabled backend from running in a database with
> >> a multibyte encoding that's not SQL_ASCII?  Or am I missing a reason why
> >> that is safe?
> 
> > Not all multibyte encodings are actually "multi"-byte, e.g., LATIN2.  In
> > that case the main benefit is the on-the-fly recoding between the client
> > and the server.  If a non-MB server encounters that database it should
> > still work.
> 
> Are these encodings all guaranteed to have the same collation order as
> SQL_ASCII?

Yes & no. 

>If not, we have the same index corruption issues as for LOCALE.

If the backend is configued with LOCALE enabled and the database is
not configured with LOCALE, we will have a problem. But this will
happen with/without MUTIBYTE anyway. Mutibyte support does nothing
with LOCALE support.
--
Tatsuo Ishii


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tatsuo Ishii
Date:
> We now have defenses against running a non-LOCALE-enabled backend in a
> database that was created in non-C locale.  Shouldn't we likewise
> prevent a non-MULTIBYTE-enabled backend from running in a database with
> a multibyte encoding that's not SQL_ASCII?  Or am I missing a reason why
> that is safe?
> 
> I propose the following addition to ReverifyMyDatabase in postinit.c:
> 
>   #ifdef MULTIBYTE
>       SetDatabaseEncoding(dbform->encoding);
> + #else
> +    if (dbform->encoding != SQL_ASCII)
> +        elog(FATAL, "some suitable error message");
>   #endif
> 
> Comments?

Running a non-MULTIBYTE-enabled backend on a database with a multibyte
encoding other than SQL_ASCII should be safe as long as:

1) read only access
2) the encodings are actually single byte encodings

If mutibyte encoding database is updated by a non-MULTIBYTE-enabled
backend, there might be a chance that data could corrupted since the
backend does not handle mutibyte strings correctly.

So I think you suggestion is a improvement.
--
Tatsuo Ishii


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Are these encodings all guaranteed to have the same collation order as
>> SQL_ASCII?

> Yes & no. 

Um, I'm confused ...

>> If not, we have the same index corruption issues as for LOCALE.

> If the backend is configued with LOCALE enabled and the database is
> not configured with LOCALE, we will have a problem. But this will
> happen with/without MUTIBYTE anyway. Mutibyte support does nothing
> with LOCALE support.

Can a backend configured with MULTIBYTE and running in non-SQL_ASCII
encoding ever sort strings in non-character-code ordering, even if it
is in C locale?  I should think that such behavior is highly likely
for multibyte character sets.

If it can, then we mustn't allow a non-MULTIBYTE backend to run in
such a database, I think.
        regards, tom lane


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tatsuo Ishii
Date:
> >> Are these encodings all guaranteed to have the same collation order as
> >> SQL_ASCII?
> 
> > Yes & no. 
> 
> Um, I'm confused ...
> 
> >> If not, we have the same index corruption issues as for LOCALE.
> 
> > If the backend is configued with LOCALE enabled and the database is
> > not configured with LOCALE, we will have a problem. But this will
> > happen with/without MUTIBYTE anyway. Mutibyte support does nothing
> > with LOCALE support.
> 
> Can a backend configured with MULTIBYTE and running in non-SQL_ASCII
> encoding ever sort strings in non-character-code ordering, even if it
> is in C locale?  I should think that such behavior is highly likely
> for multibyte character sets.

Hum, I don't think I understand your point because of my English
abilities. I'm going to explain what I want to say in hex
representation, rather than English:-)

Suppose we have four EUC_JP multibyte strings, each consists of two
bytes (actually they are my name in KANJI characters). They would look
like:

0xc0d0
0xb0e6
0xc3a3
0xd7c9

If we sort these strings using strcmp(), we would get:

0xb0e6
0xc0d0
0xc3a3
0xd7c9

This result might not be perfect, but resonable for most cases since
the code value of each character in EUC_JP is defined in the hope that
it can be sorted by its phisical value.

If we are not satisfied with this result for some reasons, we could
add an auxiliary "yomigana" field to get the correct order (Yomigana
is a pronounciation of KANJI).

> If it can, then we mustn't allow a non-MULTIBYTE backend to run in
> such a database, I think.
> 
>             regards, tom lane

Can you explain more about this?
--
Tatsuo Ishii


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> If we sort these strings using strcmp(), we would get:
> ...
> This result might not be perfect, but resonable for most cases since
> the code value of each character in EUC_JP is defined in the hope that
> it can be sorted by its phisical value.

> If we are not satisfied with this result for some reasons, we could
> add an auxiliary "yomigana" field to get the correct order (Yomigana
> is a pronounciation of KANJI).

Okay, so if a database has been built by a backend that knows MULTIBYTE
and has some "yomigana" info available, then indexes in text columns
will not be in the same order that strcmp() would put them in, right?

If we then run a non-MULTIBYTE backend in that database, it will see
the indexes as being out of correct order; this will cause indexscans
to miss values they should find, or perhaps fail outright if the code
happens to detect the inconsistency.  If the backend inserts a value
in an index in strcmp() order, the value may be out of place according
to the "yomigana" info, in which case the index is now corrupt from
the point of view of a MULTIBYTE-aware backend as well.

This is essentially the same problem as between LOCALE-aware and
non-LOCALE-aware backends in a database with a non-C locale.

In short, unless you want to enforce a restriction that MULTIBYTE
ordering is always strcmp() order and never anything else, we'd better
disallow non-MULTIBYTE backends in MULTIBYTE databases.
        regards, tom lane


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > If we sort these strings using strcmp(), we would get:
> > ...
> > This result might not be perfect, but resonable for most cases since
> > the code value of each character in EUC_JP is defined in the hope that
> > it can be sorted by its phisical value.
> 
> > If we are not satisfied with this result for some reasons, we could
> > add an auxiliary "yomigana" field to get the correct order (Yomigana
> > is a pronounciation of KANJI).
> 
> Okay, so if a database has been built by a backend that knows MULTIBYTE
> and has some "yomigana" info available, then indexes in text columns
> will not be in the same order that strcmp() would put them in, right?

No. The "yomigana" exists in the application world, not in the
database engine itself. What I was talking about was an idea to add
an extra column to a table.

create table t1 (      kanji text,    -- KANJI field      yomigana    text    -- YOMIGANA field
);

The query would be something like:

select kanji from t1 order by yomigana;
--
Tatsuo Ishii


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Okay, so if a database has been built by a backend that knows MULTIBYTE
>> and has some "yomigana" info available, then indexes in text columns
>> will not be in the same order that strcmp() would put them in, right?

> No. The "yomigana" exists in the application world, not in the
> database engine itself. What I was talking about was an idea to add
> an extra column to a table.

Oh, I see.  So the question still remains: can a MULTIBYTE-aware backend
ever use a sort order different from strcmp() order?  (That is, not as
a result of LOCALE, but just because of the non-SQL-ASCII encoding.)

Actually there are more complicated cases that would depend on more
features of the encoding than just sort order.  Consider
CREATE INDEX fooi ON foo (upper(field1));

Operations involving this index will misbehave if the behavior of
upper() ever differs between MULTIBYTE-aware and non-MULTIBYTE-aware
code.  That seems pretty likely for encodings like LATIN2...
        regards, tom lane


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Peter Eisentraut
Date:
Tom Lane writes:

> Oh, I see.  So the question still remains: can a MULTIBYTE-aware backend
> ever use a sort order different from strcmp() order?  (That is, not as
> a result of LOCALE, but just because of the non-SQL-ASCII encoding.)

According to the code, no, because varstr_cmp() doesn't pay attention to
the multibyte status.  Presumably strcmp() and strcoll() don't either.

> Actually there are more complicated cases that would depend on more
> features of the encoding than just sort order.  Consider
>
>     CREATE INDEX fooi ON foo (upper(field1));
>
> Operations involving this index will misbehave if the behavior of
> upper() ever differs between MULTIBYTE-aware and non-MULTIBYTE-aware
> code.  That seems pretty likely for encodings like LATIN2...

Of course in the most general case this is a problem, because a function
can be implemented totally differently depending on any old #ifdef or
other external factors.

If the multibyte users think this check is okay, then I don't mind, since
it's usually what the users would want anyway.  I'm just pointing out the
technical issues.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tatsuo Ishii
Date:
> Tom Lane writes:
> 
> > Oh, I see.  So the question still remains: can a MULTIBYTE-aware backend
> > ever use a sort order different from strcmp() order?  (That is, not as
> > a result of LOCALE, but just because of the non-SQL-ASCII encoding.)
> 
> According to the code, no, because varstr_cmp() doesn't pay attention to
> the multibyte status.  Presumably strcmp() and strcoll() don't either.

Right.

> > Actually there are more complicated cases that would depend on more
> > features of the encoding than just sort order.  Consider
> >
> >     CREATE INDEX fooi ON foo (upper(field1));
> >
> > Operations involving this index will misbehave if the behavior of
> > upper() ever differs between MULTIBYTE-aware and non-MULTIBYTE-aware
> > code.  That seems pretty likely for encodings like LATIN2...
> 
> Of course in the most general case this is a problem, because a function
> can be implemented totally differently depending on any old #ifdef or
> other external factors.
> 
> If the multibyte users think this check is okay, then I don't mind, since
> it's usually what the users would want anyway.  I'm just pointing out the
> technical issues.

Right. However, Tom's point is a little bit different, I guess.

As far as I know, most builtin functions taking string data types as
their aruguments would behave same with/without MULTIBYTE.  As far as
I know exceptions include:

char_length
quote_ident
quote_literal
ascii
to_ascii

So, for example, 

CREATE INDEX fooi ON foo (char_length(field1));

would behave differently with/without MULTIBYTE if the encoding for
the database is not "single byte type".
--
Tatsuo Ishii


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Oh, I see.  So the question still remains: can a MULTIBYTE-aware backend
> ever use a sort order different from strcmp() order?  (That is, not as
> a result of LOCALE, but just because of the non-SQL-ASCII encoding.)
>> 
>> According to the code, no, because varstr_cmp() doesn't pay attention to
>> the multibyte status.  Presumably strcmp() and strcoll() don't either.

> Right.

OK, so I guess this comes down to a judgment call: should we insert the
check in the non-MULTIBYTE case, or not?  I still think it's safest to
do so, but I'm not sure what you want to do.
        regards, tom lane


Re: Shouldn't non-MULTIBYTE backend refuse to start in MB database?

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > Oh, I see.  So the question still remains: can a MULTIBYTE-aware backend
> > ever use a sort order different from strcmp() order?  (That is, not as
> > a result of LOCALE, but just because of the non-SQL-ASCII encoding.)
> >> 
> >> According to the code, no, because varstr_cmp() doesn't pay attention to
> >> the multibyte status.  Presumably strcmp() and strcoll() don't either.
> 
> > Right.
> 
> OK, so I guess this comes down to a judgment call: should we insert the
> check in the non-MULTIBYTE case, or not?  I still think it's safest to
> do so, but I'm not sure what you want to do.
> 
>             regards, tom lane

I have discussed with Japanese hackers including Hiroshi of this
issue. We have reached the conclusion that your proposal is
appropreate and will make PostgreSQL more statble.
--
Tatsuo Ishii