Thread: postgres & server encodings
What exactly is the SQL_ASCII encoding in postgres? I have a pg installation with an SQL_ASCII server encoding and my database (also SQL_ASCII) seems to be able to handle all types of encodings? how is this possible? is this what the server & db encoding needs to be set to in order to handle various encodings? Also is there a way of dumping data from an SQL_ASCII db to a UNICODE db. Creating a dump and converting with a tool like iconv is probably not an option since the data in the SQL_ASCII db has unknown/inconsistant encoding. Please let me know as soon as possible. Thanks. Salem P.S. {I sent a similar e-mail earlier which wasn't posted on the site. ??}
"Salem Berhanu" <salemb4@hotmail.com> writes: > What exactly is the SQL_ASCII encoding in postgres? SQL_ASCII isn't so much an encoding as the declaration that you don't care about encodings. That setting simply disables encoding validity checks and encoding conversions. The server will take any byte string clients send it (barring only embedded zero bytes), and store and return it unchanged. Since it disables conversions, the notion of converting to another encoding is pretty much meaningless :-(. regards, tom lane
On Mon, Aug 08, 2005 at 04:10:50PM +0000, Salem Berhanu wrote: > What exactly is the SQL_ASCII encoding in postgres? I have a pg > installation with an SQL_ASCII server encoding and my database (also > SQL_ASCII) seems to be able to handle all types of encodings? how is this > possible? is this what the server & db encoding needs to be set to in order > to handle various encodings? SQL_ASCII means that the database does no locale specific or language specific encoding ever. It won't check what you send it either. If you're content to let clients deal with any encoding issues, this may be what you want. But anything to do with lower(), upper(), case-insenstive in the database itself will be totally stupid since it's assuming ASCII. > Also is there a way of dumping data from an SQL_ASCII db to a UNICODE db. > Creating a dump and converting with a tool like iconv is probably not an > option since the data in the SQL_ASCII db has unknown/inconsistant encoding. > Please let me know as soon as possible. This is a messy situation. Since the system can't guess your encoding you'd have to fix it all up yourself... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > SQL_ASCII means that the database does no locale specific or language > specific encoding ever. It won't check what you send it either. If > you're content to let clients deal with any encoding issues, this may > be what you want. > But anything to do with lower(), upper(), case-insenstive in the > database itself will be totally stupid since it's assuming ASCII. Actually it's worse than that: case conversion and comparisons inside the database will still be done using whatever locale setting the database was initdb'd with. If that locale is C, then you have just ASCII-aware case conversions and memcmp-like sorting, but everything will behave reasonably sanely within those limitations. If the locale is not C then it is assuming some specific encoding that *the database is not enforcing*. This is bad news and can result in all sorts of inconsistent behavior. We really need to figure out some way of enforcing that the database encoding is OK to use with the locale setting. Peter put some trial code for this into initdb in 8.0, but it's only issuing a warning rather than enforcing the restriction, so I don't have a lot of confidence that it is right. (How many people even noticed the warning message?) regards, tom lane
At 05:59 PM 8/9/2005 +0200, Martijn van Oosterhout wrote: >SQL_ASCII means that the database does no locale specific or language >specific encoding ever. It won't check what you send it either. If >you're content to let clients deal with any encoding issues, this may >be what you want. > >But anything to do with lower(), upper(), case-insenstive in the >database itself will be totally stupid since it's assuming ASCII. Is it possible or even good to have the ability to allow you to pick a particular locale for a query/function? e.g. select * from messages where locale_code=$locale_code order by locale_code, multilocale_lower(message,locale); Or even: create index lower_keyword_idx on keywords (multilocale_lower(keyword,locale)) (there's a column called locale in both tables) Does that actually make sense? ;) I suppose we can do that in the client. But it'll be nicer if we can use "order by", "group by", and do it for more than one locale at a time. Can Postgresql currently handle more than one locale within the same database AND have some useful locale sensitive DB functions? Regards, Link.
Not that I am an expert or anything, but my initial data base was SQLASCII and I did have to convert it to Unicode. My reasons were we store French characters in our database and the newer odbc driver was not displaying them correctly coming from SQLASCII, but was from UNICODE. I also think that it can affect functions like length and upper, but Tom knows a ton more then me about this stuff. I did my initial conversion on 7.4 and the odbc driver at that time had no issues with SQLASCII displaying the French, but I think in 8.0.1 I started seeing an issue. The latest version of the driver 8.0.4 seems to be working well (only up a little over 24 hours thus far). I wish I had used a unicode data base from the start (7.4 driver was what I used and it did not like moving from MSSQL to Unicode). I later switched to .net (npgsql objects) for my conversion and used a encoding object to write the data correctly. Joel Fradkin Wazagua, Inc. 2520 Trailmate Dr Sarasota, Florida 34243 Tel. 941-753-7111 ext 305 jfradkin@wazagua.com www.wazagua.com Powered by Wazagua Providing you with the latest Web-based technology & advanced tools. C 2004. WAZAGUA, Inc. All rights reserved. WAZAGUA, Inc This email message is for the use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and delete and destroy all copies of the original message, including attachments. -----Original Message----- From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Tom Lane Sent: Tuesday, August 09, 2005 11:59 AM To: Salem Berhanu Cc: pgsql-admin@postgresql.org; pgsql-general@postgresql.org Subject: Re: [ADMIN] [GENERAL] postgres & server encodings "Salem Berhanu" <salemb4@hotmail.com> writes: > What exactly is the SQL_ASCII encoding in postgres? SQL_ASCII isn't so much an encoding as the declaration that you don't care about encodings. That setting simply disables encoding validity checks and encoding conversions. The server will take any byte string clients send it (barring only embedded zero bytes), and store and return it unchanged. Since it disables conversions, the notion of converting to another encoding is pretty much meaningless :-(. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
On Tue, Aug 09, 2005 at 12:56:37PM -0400, Joel Fradkin wrote: > Not that I am an expert or anything, but my initial data base was SQLASCII > and I did have to convert it to Unicode. > My reasons were we store French characters in our database and the newer > odbc driver was not displaying them correctly coming from SQLASCII, but was > from UNICODE. > I also think that it can affect functions like length and upper, but Tom > knows a ton more then me about this stuff. > > I did my initial conversion on 7.4 and the odbc driver at that time had no > issues with SQLASCII displaying the French, but I think in 8.0.1 I started > seeing an issue. The latest version of the driver 8.0.4 seems to be working > well (only up a little over 24 hours thus far). A conversion will work fine assuming the data is all encoded using the same encoding. So if it's all utf8 ("Unicode") already, you can import it verbatim into a UTF8 database and it will work fine. If it's all Latin-1, you can import into a UTF-8 db using a client_encoding=latin1 during import, or verbatim to a Latin-1 database, and it will also work fine. (You of course are expected to be able to figure out what encoding is the data really in.) The problem only shows up when you have mixed data -- say, you have two applications, one website in PHP which inserts data in Latin-1, and a Windows app which inserts in UTF-8. In this case your data will be a mess to fix, and there's no way a single conversion will get it right. You will have to manually separate the parts that are UTF8 from the Latin1, and import them separately. Not a position I'd like to be in. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "Coge la flor que hoy nace alegre, ufana. ¿Quién sabe si nacera otra mañana?"
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > The problem only shows up when you have mixed data -- say, you have two > applications, one website in PHP which inserts data in Latin-1, and a > Windows app which inserts in UTF-8. In this case your data will be a > mess to fix, and there's no way a single conversion will get it right. > You will have to manually separate the parts that are UTF8 from the > Latin1, and import them separately. Not a position I'd like to be in. The only helpful tip I can think of is that you can try to import data into a UTF8 database and see if it gets rejected as badly encoded; this will at least give you a weak tool to separate what's what. I'm afraid the reverse direction won't help much --- in single-byte encodings such as Latin1 there are no encoding errors, and so you can't do any simple filtering to check in that direction. In the end you're going to have to eyeball a lot of data for plausibility :-( regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Salem Berhanu" <salemb4@hotmail.com> writes: > > What exactly is the SQL_ASCII encoding in postgres? > > SQL_ASCII isn't so much an encoding as the declaration that you don't > care about encodings. It's too late to consider renaming this SQL_RAW or something like that is it? It is a huge source of confusion. Perhaps have a separate "ascii" encoding that checks and complains if any non-ascii characters are present. -- greg