Thread: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > citext unfortunately doesn't allow for index optimization of LIKE > queries, which IMNSHO defeats the whole purpose. to the best way > remains to use lower() ... > this will be index optimized and fast as long as you specified C > locale for your database. What is the difference between C and en_US.UTF8, please? We see that the same query (that invokes a sort) runs 15% faster under the C locale. The output between C and en_US.UTF8 is identical. We're considering moving our database from en_US.UTF8 to C, but we do deal with internationalized text. Best, Aleksey
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote: > On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > > citext unfortunately doesn't allow for index optimization of LIKE > > queries, which IMNSHO defeats the whole purpose. to the best way > > remains to use lower() ... > > this will be index optimized and fast as long as you specified C > > locale for your database. > > What is the difference between C and en_US.UTF8, please? We see that > the same query (that invokes a sort) runs 15% faster under the C > locale. The output between C and en_US.UTF8 is identical. We're > considering moving our database from en_US.UTF8 to C, but we do deal > with internationalized text. Well, C has reduced overhead for string comparisons, but obviously doesn't work well for international characters. The single-byte encodings have somewhat less overhead than UTF8. You can try using C locales for databases that don't require non-ASCII characters. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote: >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> > citext unfortunately doesn't allow for index optimization of LIKE >> > queries, which IMNSHO defeats the whole purpose. to the best way >> > remains to use lower() ... >> > this will be index optimized and fast as long as you specified C >> > locale for your database. >> >> What is the difference between C and en_US.UTF8, please? We see that >> the same query (that invokes a sort) runs 15% faster under the C >> locale. The output between C and en_US.UTF8 is identical. We're >> considering moving our database from en_US.UTF8 to C, but we do deal >> with internationalized text. > > Well, C has reduced overhead for string comparisons, but obviously > doesn't work well for international characters. The single-byte > encodings have somewhat less overhead than UTF8. You can try using C > locales for databases that don't require non-ASCII characters. To add: The middle ground I usually choose is to have a database encoding of UTF8 but with the C (aka POSIX) locale. This gives you the ability to store any unicode but indexing operations will use the faster C string comparison operations for a significant performance boost -- especially for partial string searches on an indexed column. This is an even more attractive option in 9.1 with the ability to specify specific collations at runtime. merlin
On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote: >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> > citext unfortunately doesn't allow for index optimization of LIKE >> > queries, which IMNSHO defeats the whole purpose. to the best way >> > remains to use lower() ... >> > this will be index optimized and fast as long as you specified C >> > locale for your database. >> >> What is the difference between C and en_US.UTF8, please? We see that >> the same query (that invokes a sort) runs 15% faster under the C >> locale. The output between C and en_US.UTF8 is identical. We're >> considering moving our database from en_US.UTF8 to C, but we do deal >> with internationalized text. > > Well, C has reduced overhead for string comparisons, but obviously > doesn't work well for international characters. The single-byte > encodings have somewhat less overhead than UTF8. You can try using C > locales for databases that don't require non-ASCII characters. I think you're confusing encodings with locales. C is a locale. You can have a database with a locale of C and UTF-8 encoding. create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0; \l Name | Owner | Encoding | Collate | Ctype | Access privileges --------------+----------+-----------+-------------+-------------+----------------------- clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 | SQL_ASCII is the encoding equivalent of C locale, but it also allows multi-byte characters.
2012/8/29 Merlin Moncure <mmoncure@gmail.com>
On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:To add:
> On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
>> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> > citext unfortunately doesn't allow for index optimization of LIKE
>> > queries, which IMNSHO defeats the whole purpose. to the best way
>> > remains to use lower() ...
>> > this will be index optimized and fast as long as you specified C
>> > locale for your database.
>>
>> What is the difference between C and en_US.UTF8, please? We see that
>> the same query (that invokes a sort) runs 15% faster under the C
>> locale. The output between C and en_US.UTF8 is identical. We're
>> considering moving our database from en_US.UTF8 to C, but we do deal
>> with internationalized text.
>
> Well, C has reduced overhead for string comparisons, but obviously
> doesn't work well for international characters. The single-byte
> encodings have somewhat less overhead than UTF8. You can try using C
> locales for databases that don't require non-ASCII characters.
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.
Good point! Thanks!
--
// Dmitriy.
On Wed, Aug 29, 2012 at 01:45:20PM -0500, Merlin Moncure wrote: > On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote: > >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> > citext unfortunately doesn't allow for index optimization of LIKE > >> > queries, which IMNSHO defeats the whole purpose. to the best way > >> > remains to use lower() ... > >> > this will be index optimized and fast as long as you specified C > >> > locale for your database. > >> > >> What is the difference between C and en_US.UTF8, please? We see that > >> the same query (that invokes a sort) runs 15% faster under the C > >> locale. The output between C and en_US.UTF8 is identical. We're > >> considering moving our database from en_US.UTF8 to C, but we do deal > >> with internationalized text. > > > > Well, C has reduced overhead for string comparisons, but obviously > > doesn't work well for international characters. The single-byte > > encodings have somewhat less overhead than UTF8. You can try using C > > locales for databases that don't require non-ASCII characters. > > To add: > The middle ground I usually choose is to have a database encoding of > UTF8 but with the C (aka POSIX) locale. This gives you the ability to > store any unicode but indexing operations will use the faster C string > comparison operations for a significant performance boost -- > especially for partial string searches on an indexed column. This is > an even more attractive option in 9.1 with the ability to specify > specific collations at runtime. Do you get proper sort ordering in this case, or only when you specific the proper collation at runtime? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote: > On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote: > >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> > citext unfortunately doesn't allow for index optimization of LIKE > >> > queries, which IMNSHO defeats the whole purpose. to the best way > >> > remains to use lower() ... > >> > this will be index optimized and fast as long as you specified C > >> > locale for your database. > >> > >> What is the difference between C and en_US.UTF8, please? We see that > >> the same query (that invokes a sort) runs 15% faster under the C > >> locale. The output between C and en_US.UTF8 is identical. We're > >> considering moving our database from en_US.UTF8 to C, but we do deal > >> with internationalized text. > > > > Well, C has reduced overhead for string comparisons, but obviously > > doesn't work well for international characters. The single-byte > > encodings have somewhat less overhead than UTF8. You can try using C > > locales for databases that don't require non-ASCII characters. > > I think you're confusing encodings with locales. C is a locale. You I think technically C is a non-locale. > can have a database with a locale of C and UTF-8 encoding. > > create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0; > > \l > Name | Owner | Encoding | Collate | Ctype | > Access privileges > --------------+----------+-----------+-------------+-------------+----------------------- > clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 | > > > SQL_ASCII is the encoding equivalent of C locale, but it also allows > multi-byte characters. Yes, but what sort ordering do you get in that case? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Aug 29, 2012 at 2:17 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote: >> On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote: >> > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote: >> >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> >> > citext unfortunately doesn't allow for index optimization of LIKE >> >> > queries, which IMNSHO defeats the whole purpose. to the best way >> >> > remains to use lower() ... >> >> > this will be index optimized and fast as long as you specified C >> >> > locale for your database. >> >> >> >> What is the difference between C and en_US.UTF8, please? We see that >> >> the same query (that invokes a sort) runs 15% faster under the C >> >> locale. The output between C and en_US.UTF8 is identical. We're >> >> considering moving our database from en_US.UTF8 to C, but we do deal >> >> with internationalized text. >> > >> > Well, C has reduced overhead for string comparisons, but obviously >> > doesn't work well for international characters. The single-byte >> > encodings have somewhat less overhead than UTF8. You can try using C >> > locales for databases that don't require non-ASCII characters. >> >> I think you're confusing encodings with locales. C is a locale. You > > I think technically C is a non-locale. True. But it's NOT an encoding. >> can have a database with a locale of C and UTF-8 encoding. >> >> create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0; >> >> \l >> Name | Owner | Encoding | Collate | Ctype | >> Access privileges >> --------------+----------+-----------+-------------+-------------+----------------------- >> clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 | >> >> >> SQL_ASCII is the encoding equivalent of C locale, but it also allows >> multi-byte characters. > > Yes, but what sort ordering do you get in that case? Byte ordering.
On 2012-08-29, Bruce Momjian <bruce@momjian.us> wrote: >> Name | Owner | Encoding | Collate | Ctype | >> Access privileges >> --------------+----------+-----------+-------------+-------------+----------------------- >> clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 | >> >> >> SQL_ASCII is the encoding equivalent of C locale, but it also allows >> multi-byte characters. > > Yes, but what sort ordering do you get in that case? > C ordering is by unicode code point number. c=# values ('a'),('Z'),('€'),('z'),('¢'),('ä'),('Ā') order by 1; column1 --------- Z a z ¢ ä Ā € -- ⚂⚃ 100% natural
On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote: > What is the difference between C and en_US.UTF8, please? There are many differences, but here is a simple one: $ (echo a; echo A; echo b; echo B) | LC_ALL=C sort A B a b $ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort a A b B
Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)
From
Aleksey Tsalolikhin
Date:
On Sun, Sep 2, 2012 at 10:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote: >> What is the difference between C and en_US.UTF8, please? > > There are many differences, but here is a simple one: > > $ (echo a; echo A; echo b; echo B) | LC_ALL=C sort > ... > $ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort > ... Beautifully illustrated, and in a language I readily understand as a UNIX sys admin. Thank you, Peter! Thank you, Bruce, Merlin, and everybody else who replied. I love the flexibility 9.1 allows to have a default sort locale with ability to alter locale per-query if needed. It gives us the confidence to move forward with switching to C for sorting -- right now we only need to sort on US English data, so this is a perfect fit. When we internationalize, 9.1 allows us to switch back to UTF8-based sorting if needed. I love it! Thank you! Aleksey