Thread: database encoding and collation.

database encoding and collation.

From

"James Pang (chaolpan)"

Date:

12 January 2022, 13:31:26

We plan to migrate from Oracle to Postgresql 13, existing Oracle database charaterset WE8ISO8859P1 , NLS_SORT = BINARY, NLS_COM=BINARY. we want to keep same encoding and collation in Postgresql 13 as Oracle, so encoding= LATIN1 (Aliases ISO88591), what’s the suggested collation and ctype? Using ‘C’,’POSIX’, or en_US.LATIN1 ? if using en_US.LATIN1 any impact to indexes and query where like % ?

The OS is Redhat 8, when export NLS_LANG=en_US.LATIN1, LC_COLLATE and LC_CTYPE automatically set to en_US.LATIN1, and then initdb create database clusters with collname, collation, ctype all = en_US.LATIN1.

Thanks,

James

Re: database encoding and collation.

From

Laurenz Albe

Date:

13 January 2022, 19:19:34

On Wed, 2022-01-12 at 13:31 +0000, James Pang (chaolpan) wrote:
> We plan to migrate from Oracle to Postgresql 13, existing Oracle database charaterset WE8ISO8859P1 ,
> NLS_SORT = BINARY, NLS_COM=BINARY. we want to keep same encoding and collation in Postgresql 13 as Oracle,
> so encoding=LATIN1 (Aliases ISO88591), what’s the suggested collation and ctype? Using ‘C’,’POSIX’, or en_US.LATIN1
?
> if using en_US.LATIN1 any impact to indexes and query where like % ?
> The OS is Redhat 8, when export NLS_LANG=en_US.LATIN1, LC_COLLATE and LC_CTYPE automatically set to en_US.LATIN1,
> and then initdb create database clusters with  collname, collation, ctype all = en_US.LATIN1.

I advise against using any encoding other than UTF-8.
What are you worried about?

For the locale, you can get the same behavior as in Oracle by using "C" or "POSIX"
(they are the same).

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

RE: database encoding and collation.

From

"James Pang (chaolpan)"

Date:

14 January 2022, 02:11:14

This question is about how to handle multiple languages in database , English, Japanese ,Chinese...  The Oracle
databaseused WE8ISO8859P1 from long time ago , when put "Chinese" letters in database, client application just use
WE8ISO8859P1(client encoding) to insert data into database , so no conversion there. When client app query the data ,
itstill get the data with WE8ISO8859P1 , and then convert that data to UTF8 from client app side.   Now, we plan
migrateto Postgresql 13,  want Postgresql database use WE8ISO8859P1 same as Oracle too, and same handling as Oracle. 

   Any potential risk ?

Thanks,

James
-----Original Message-----
From: Laurenz Albe <laurenz.albe@cybertec.at> 
Sent: Friday, January 14, 2022 3:20 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>; pgsql-admin@lists.postgresql.org
Subject: Re: database encoding and collation.

On Wed, 2022-01-12 at 13:31 +0000, James Pang (chaolpan) wrote:
> We plan to migrate from Oracle to Postgresql 13, existing Oracle 
> database charaterset WE8ISO8859P1 , NLS_SORT = BINARY, NLS_COM=BINARY. 
> we want to keep same encoding and collation in Postgresql 13 as Oracle, so encoding=LATIN1 (Aliases ISO88591), what’s
thesuggested collation and ctype? Using ‘C’,’POSIX’, or en_US.LATIN1 ?

> if using en_US.LATIN1 any impact to indexes and query where like % ?
> The OS is Redhat 8, when export NLS_LANG=en_US.LATIN1, LC_COLLATE and 
> LC_CTYPE automatically set to en_US.LATIN1, and then initdb create database clusters with  collname, collation, ctype
all= en_US.LATIN1.

I advise against using any encoding other than UTF-8.
What are you worried about?

For the locale, you can get the same behavior as in Oracle by using "C" or "POSIX"
(they are the same).

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

RE: database encoding and collation.

From

"James Pang (chaolpan)"

Date:

14 January 2022, 02:14:23

This question is about how to handle multiple languages in database , English, Japanese ,Chinese...  The Oracle
databaseused WE8ISO8859P1 from long time ago , when put "Chinese" letters in database, client application just use
WE8ISO8859P1(client encoding) to insert data into database , so no conversion there. When client app query the data ,
itstill get the data with WE8ISO8859P1 , and then convert that data to UTF8 from client app side.   Now, we plan
migrateto Postgresql 13,  want Postgresql database use WE8ISO8859P1 same as Oracle too, and same handling as Oracle.
Anypotential risk ?

   When using locale "C" or "POSIX", only A-Z characters got sorted by binary , for other languages like French or
Chinese,any suggestions?

Thanks,

James
-----Original Message-----
From: Laurenz Albe <laurenz.albe@cybertec.at>
Sent: Friday, January 14, 2022 3:20 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>; pgsql-admin@lists.postgresql.org
Subject: Re: database encoding and collation.

On Wed, 2022-01-12 at 13:31 +0000, James Pang (chaolpan) wrote:
> We plan to migrate from Oracle to Postgresql 13, existing Oracle 
> database charaterset WE8ISO8859P1 , NLS_SORT = BINARY, NLS_COM=BINARY.
> we want to keep same encoding and collation in Postgresql 13 as Oracle, so encoding=LATIN1 (Aliases ISO88591), what’s
thesuggested collation and ctype? Using ‘C’,’POSIX’, or en_US.LATIN1 ?

> if using en_US.LATIN1 any impact to indexes and query where like % ?
> The OS is Redhat 8, when export NLS_LANG=en_US.LATIN1, LC_COLLATE and 
> LC_CTYPE automatically set to en_US.LATIN1, and then initdb create database clusters with  collname, collation, ctype
all= en_US.LATIN1.

I advise against using any encoding other than UTF-8.
What are you worried about?

For the locale, you can get the same behavior as in Oracle by using "C" or "POSIX"
(they are the same).

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: database encoding and collation.

From

Ron

Date:

14 January 2022, 02:34:21

Try UTF-8 in a little test application, and see what happens.

On 1/13/22 8:11 PM, James Pang (chaolpan) wrote:

    This question is about how to handle multiple languages in database , English, Japanese ,Chinese...  The Oracle database used WE8ISO8859P1 from long time ago , when put "Chinese" letters in database, client application just use  WE8ISO8859P1 (client encoding) to insert data into database , so no conversion there. When client app query the data , it still get the data with WE8ISO8859P1 , and then convert that data to UTF8 from client app side.   Now, we plan migrate to Postgresql 13,  want Postgresql database use WE8ISO8859P1 same as Oracle too, and same handling as Oracle.    Any potential risk ?

Thanks,

James
-----Original Message-----
From: Laurenz Albe <laurenz.albe@cybertec.at> 
Sent: Friday, January 14, 2022 3:20 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>; pgsql-admin@lists.postgresql.org
Subject: Re: database encoding and collation.

On Wed, 2022-01-12 at 13:31 +0000, James Pang (chaolpan) wrote:

We plan to migrate from Oracle to Postgresql 13, existing Oracle 
database charaterset WE8ISO8859P1 , NLS_SORT = BINARY, NLS_COM=BINARY. 
we want to keep same encoding and collation in Postgresql 13 as Oracle, so encoding=LATIN1 (Aliases ISO88591), what’s the suggested collation and ctype? Using ‘C’,’POSIX’, or en_US.LATIN1 ?
if using en_US.LATIN1 any impact to indexes and query where like % ?
The OS is Redhat 8, when export NLS_LANG=en_US.LATIN1, LC_COLLATE and 
LC_CTYPE automatically set to en_US.LATIN1, and then initdb create database clusters with  collname, collation, ctype all = en_US.LATIN1.

I advise against using any encoding other than UTF-8.
What are you worried about?

For the locale, you can get the same behavior as in Oracle by using "C" or "POSIX"
(they are the same).

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

--
Angular momentum makes the world go 'round.

Re: database encoding and collation.

From

Laurenz Albe

Date:

14 January 2022, 07:17:14

On Fri, 2022-01-14 at 02:11 +0000, James Pang (chaolpan) wrote:
> This question is about how to handle multiple languages in database , English, Japanese ,Chinese...
> The Oracle database used WE8ISO8859P1 from long time ago , when put "Chinese" letters in database,
> client application just use  WE8ISO8859P1 (client encoding) to insert data into database ,
> so no conversion there. When client app query the data , it still get the data with WE8ISO8859P1 ,
> and then convert that data to UTF8 from client app side.   Now, we plan migrate to Postgresql 13,
> want Postgresql database use WE8ISO8859P1 same as Oracle too, and same handling as Oracle. 
> Any potential risk ?

I see the problem: the sloppy checking of Oracle allowed you to insert corrupted strings into
the Oracle database by setting client encoding equal to server encoding, and you might not
detect the problem until you try to read the data with a different client encoding.

You may be able to carry on like this in PostgreSQL, since all bytes are valid LATIN1
code points, but (like in Oracle) you will run into problems as soon as you access the
database with a different client encoding, for example with a JDBC driver.

It would be better to fix the problem.  If all the data in the database are actually
encoded in a certain encoding (different from ISO-8859-1), you could simply import them
into a database with that encoding, and you would have the benefit of having correct data.
If different strings are encoded in different encodings, then you would have to repair the
data before doing that.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com