Thread: postgres & server encodings

postgres & server encodings

From
"Salem Berhanu"
Date:
What exactly is the SQL_ASCII encoding in postgres? I have a pg installation
with an SQL_ASCII server encoding and my database (also SQL_ASCII) seems to
be able to handle all types of encodings? how is this possible? is this what
the server & db encoding needs to be set to in order to handle various
encodings?
Also is there a way of dumping data from an SQL_ASCII db to a UNICODE db.
Creating a dump and converting with a tool like iconv is probably not an
option since the data in the SQL_ASCII db has unknown/inconsistant encoding.
Please let me know as soon as possible.
Thanks.
Salem
P.S. {I sent a similar e-mail earlier which wasn't posted on the site. ??}


Re: postgres & server encodings

From
Tom Lane
Date:
"Salem Berhanu" <salemb4@hotmail.com> writes:
> What exactly is the SQL_ASCII encoding in postgres?

SQL_ASCII isn't so much an encoding as the declaration that you don't
care about encodings.  That setting simply disables encoding validity
checks and encoding conversions.  The server will take any byte string
clients send it (barring only embedded zero bytes), and store and return
it unchanged.

Since it disables conversions, the notion of converting to another
encoding is pretty much meaningless :-(.

            regards, tom lane

Re: postgres & server encodings

From
Martijn van Oosterhout
Date:
On Mon, Aug 08, 2005 at 04:10:50PM +0000, Salem Berhanu wrote:
> What exactly is the SQL_ASCII encoding in postgres? I have a pg
> installation with an SQL_ASCII server encoding and my database (also
> SQL_ASCII) seems to be able to handle all types of encodings? how is this
> possible? is this what the server & db encoding needs to be set to in order
> to handle various encodings?

SQL_ASCII means that the database does no locale specific or language
specific encoding ever. It won't check what you send it either. If
you're content to let clients deal with any encoding issues, this may
be what you want.

But anything to do with lower(), upper(), case-insenstive in the
database itself will be totally stupid since it's assuming ASCII.

> Also is there a way of dumping data from an SQL_ASCII db to a UNICODE db.
> Creating a dump and converting with a tool like iconv is probably not an
> option since the data in the SQL_ASCII db has unknown/inconsistant encoding.
> Please let me know as soon as possible.

This is a messy situation. Since the system can't guess your encoding
you'd have to fix it all up yourself...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: postgres & server encodings

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> SQL_ASCII means that the database does no locale specific or language
> specific encoding ever. It won't check what you send it either. If
> you're content to let clients deal with any encoding issues, this may
> be what you want.

> But anything to do with lower(), upper(), case-insenstive in the
> database itself will be totally stupid since it's assuming ASCII.

Actually it's worse than that: case conversion and comparisons inside
the database will still be done using whatever locale setting the
database was initdb'd with.  If that locale is C, then you have just
ASCII-aware case conversions and memcmp-like sorting, but everything
will behave reasonably sanely within those limitations.  If the locale
is not C then it is assuming some specific encoding that *the database
is not enforcing*.  This is bad news and can result in all sorts of
inconsistent behavior.

We really need to figure out some way of enforcing that the database
encoding is OK to use with the locale setting.  Peter put some trial
code for this into initdb in 8.0, but it's only issuing a warning rather
than enforcing the restriction, so I don't have a lot of confidence
that it is right.  (How many people even noticed the warning message?)

            regards, tom lane

Re: postgres & server encodings

From
Lincoln Yeoh
Date:
At 05:59 PM 8/9/2005 +0200, Martijn van Oosterhout wrote:

>SQL_ASCII means that the database does no locale specific or language
>specific encoding ever. It won't check what you send it either. If
>you're content to let clients deal with any encoding issues, this may
>be what you want.
>
>But anything to do with lower(), upper(), case-insenstive in the
>database itself will be totally stupid since it's assuming ASCII.

Is it possible or even good to have the ability to allow you to pick a
particular locale for a query/function?

e.g. select * from messages where locale_code=$locale_code order by
locale_code, multilocale_lower(message,locale);

Or even:

create index lower_keyword_idx on keywords (multilocale_lower(keyword,locale))
(there's a column called locale in both tables)

Does that actually make sense? ;)

I suppose we can do that in the client. But it'll be nicer if we can use
"order by", "group by", and do it for more than one locale at a time.

Can Postgresql currently handle more than one locale within the same
database AND have some useful locale sensitive DB functions?

Regards,
Link.



Re: [ADMIN] postgres & server encodings

From
"Joel Fradkin"
Date:
Not that I am an expert or anything, but my initial data base was SQLASCII
and I did have to convert it to Unicode.
My reasons were we store French characters in our database and the newer
odbc driver was not displaying them correctly coming from SQLASCII, but was
from UNICODE.
I also think that it can affect functions like length and upper, but Tom
knows a ton more then me about this stuff.

I did my initial conversion on 7.4 and the odbc driver at that time had no
issues with SQLASCII displaying the French, but I think in 8.0.1 I started
seeing an issue. The latest version of the driver 8.0.4 seems to be working
well (only up a little over 24 hours thus far).

I wish I had used a unicode data base from the start (7.4 driver was what I
used and it did not like moving from MSSQL to Unicode). I later switched to
.net (npgsql objects) for my conversion and used a encoding object to write
the data correctly.

Joel Fradkin

Wazagua, Inc.
2520 Trailmate Dr
Sarasota, Florida 34243
Tel.  941-753-7111 ext 305

jfradkin@wazagua.com
www.wazagua.com
Powered by Wazagua
Providing you with the latest Web-based technology & advanced tools.
C 2004. WAZAGUA, Inc. All rights reserved. WAZAGUA, Inc
 This email message is for the use of the intended recipient(s) and may
contain confidential and privileged information.  Any unauthorized review,
use, disclosure or distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply email and delete and destroy
all copies of the original message, including attachments.




-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Tuesday, August 09, 2005 11:59 AM
To: Salem Berhanu
Cc: pgsql-admin@postgresql.org; pgsql-general@postgresql.org
Subject: Re: [ADMIN] [GENERAL] postgres & server encodings

"Salem Berhanu" <salemb4@hotmail.com> writes:
> What exactly is the SQL_ASCII encoding in postgres?

SQL_ASCII isn't so much an encoding as the declaration that you don't
care about encodings.  That setting simply disables encoding validity
checks and encoding conversions.  The server will take any byte string
clients send it (barring only embedded zero bytes), and store and return
it unchanged.

Since it disables conversions, the notion of converting to another
encoding is pretty much meaningless :-(.

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] postgres & server encodings

From
Alvaro Herrera
Date:
On Tue, Aug 09, 2005 at 12:56:37PM -0400, Joel Fradkin wrote:
> Not that I am an expert or anything, but my initial data base was SQLASCII
> and I did have to convert it to Unicode.
> My reasons were we store French characters in our database and the newer
> odbc driver was not displaying them correctly coming from SQLASCII, but was
> from UNICODE.
> I also think that it can affect functions like length and upper, but Tom
> knows a ton more then me about this stuff.
>
> I did my initial conversion on 7.4 and the odbc driver at that time had no
> issues with SQLASCII displaying the French, but I think in 8.0.1 I started
> seeing an issue. The latest version of the driver 8.0.4 seems to be working
> well (only up a little over 24 hours thus far).

A conversion will work fine assuming the data is all encoded using the
same encoding.  So if it's all utf8 ("Unicode") already, you can import
it verbatim into a UTF8 database and it will work fine.  If it's all
Latin-1, you can import into a UTF-8 db using a client_encoding=latin1
during import, or verbatim to a Latin-1 database, and it will also work
fine.  (You of course are expected to be able to figure out what
encoding is the data really in.)

The problem only shows up when you have mixed data -- say, you have two
applications, one website in PHP which inserts data in Latin-1, and a
Windows app which inserts in UTF-8.  In this case your data will be a
mess to fix, and there's no way a single conversion will get it right.
You will have to manually separate the parts that are UTF8 from the
Latin1, and import them separately.  Not a position I'd like to be in.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"Coge la flor que hoy nace alegre, ufana. ¿Quién sabe si nacera otra mañana?"

Re: [ADMIN] postgres & server encodings

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> The problem only shows up when you have mixed data -- say, you have two
> applications, one website in PHP which inserts data in Latin-1, and a
> Windows app which inserts in UTF-8.  In this case your data will be a
> mess to fix, and there's no way a single conversion will get it right.
> You will have to manually separate the parts that are UTF8 from the
> Latin1, and import them separately.  Not a position I'd like to be in.

The only helpful tip I can think of is that you can try to import data
into a UTF8 database and see if it gets rejected as badly encoded; this
will at least give you a weak tool to separate what's what.

I'm afraid the reverse direction won't help much --- in single-byte
encodings such as Latin1 there are no encoding errors, and so you can't
do any simple filtering to check in that direction.  In the end you're
going to have to eyeball a lot of data for plausibility :-(

            regards, tom lane

Re: postgres & server encodings

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Salem Berhanu" <salemb4@hotmail.com> writes:
> > What exactly is the SQL_ASCII encoding in postgres?
>
> SQL_ASCII isn't so much an encoding as the declaration that you don't
> care about encodings.

It's too late to consider renaming this SQL_RAW or something like that is it?
It is a huge source of confusion.

Perhaps have a separate "ascii" encoding that checks and complains if any
non-ascii characters are present.

--
greg