Thread: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Aleksey Tsalolikhin

Date:

29 August 2012, 17:32:07

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> citext unfortunately doesn't allow for index optimization of LIKE
> queries, which IMNSHO defeats the whole purpose.  to the best way
> remains to use lower() ...
> this will be index optimized and fast as long as you specified C
> locale for your database.

What is the difference between C and en_US.UTF8, please?  We see that
the same query (that invokes a sort) runs 15% faster under the C
locale.  The output between C and en_US.UTF8 is identical.  We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Best,
Aleksey

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Bruce Momjian

Date:

29 August 2012, 17:43:39

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> > citext unfortunately doesn't allow for index optimization of LIKE
> > queries, which IMNSHO defeats the whole purpose.  to the best way
> > remains to use lower() ...
> > this will be index optimized and fast as long as you specified C
> > locale for your database.
>
> What is the difference between C and en_US.UTF8, please?  We see that
> the same query (that invokes a sort) runs 15% faster under the C
> locale.  The output between C and en_US.UTF8 is identical.  We're
> considering moving our database from en_US.UTF8 to C, but we do deal
> with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters.  The single-byte
encodings have somewhat less overhead than UTF8.  You can try using C
locales for databases that don't require non-ASCII characters.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Merlin Moncure

Date:

29 August 2012, 18:45:25

On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
>> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> > citext unfortunately doesn't allow for index optimization of LIKE
>> > queries, which IMNSHO defeats the whole purpose.  to the best way
>> > remains to use lower() ...
>> > this will be index optimized and fast as long as you specified C
>> > locale for your database.
>>
>> What is the difference between C and en_US.UTF8, please?  We see that
>> the same query (that invokes a sort) runs 15% faster under the C
>> locale.  The output between C and en_US.UTF8 is identical.  We're
>> considering moving our database from en_US.UTF8 to C, but we do deal
>> with internationalized text.
>
> Well, C has reduced overhead for string comparisons, but obviously
> doesn't work well for international characters.  The single-byte
> encodings have somewhat less overhead than UTF8.  You can try using C
> locales for databases that don't require non-ASCII characters.

To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale.  This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column.  This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.

merlin

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Scott Marlowe

Date:

29 August 2012, 18:52:54

On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
>> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> > citext unfortunately doesn't allow for index optimization of LIKE
>> > queries, which IMNSHO defeats the whole purpose.  to the best way
>> > remains to use lower() ...
>> > this will be index optimized and fast as long as you specified C
>> > locale for your database.
>>
>> What is the difference between C and en_US.UTF8, please?  We see that
>> the same query (that invokes a sort) runs 15% faster under the C
>> locale.  The output between C and en_US.UTF8 is identical.  We're
>> considering moving our database from en_US.UTF8 to C, but we do deal
>> with internationalized text.
>
> Well, C has reduced overhead for string comparisons, but obviously
> doesn't work well for international characters.  The single-byte
> encodings have somewhat less overhead than UTF8.  You can try using C
> locales for databases that don't require non-ASCII characters.

I think you're confusing encodings with locales.  C is a locale. You
can have a database with a locale of C and UTF-8 encoding.

create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;

\l
     Name     |  Owner   | Encoding  |   Collate   |    Ctype    |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
 clocale_utf8 | smarlowe | UTF8      | C           | en_US.UTF-8 |


SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Dmitriy Igrishin

Date:

29 August 2012, 19:14:11

2012/8/29 Merlin Moncure <mmoncure@gmail.com>

On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
>> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> > citext unfortunately doesn't allow for index optimization of LIKE
>> > queries, which IMNSHO defeats the whole purpose. to the best way
>> > remains to use lower() ...
>> > this will be index optimized and fast as long as you specified C
>> > locale for your database.
>>
>> What is the difference between C and en_US.UTF8, please? We see that
>> the same query (that invokes a sort) runs 15% faster under the C
>> locale. The output between C and en_US.UTF8 is identical. We're
>> considering moving our database from en_US.UTF8 to C, but we do deal
>> with internationalized text.
>
> Well, C has reduced overhead for string comparisons, but obviously
> doesn't work well for international characters. The single-byte
> encodings have somewhat less overhead than UTF8. You can try using C
> locales for databases that don't require non-ASCII characters.

To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.

Good point! Thanks!

--
// Dmitriy.

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Bruce Momjian

Date:

29 August 2012, 20:16:00

On Wed, Aug 29, 2012 at 01:45:20PM -0500, Merlin Moncure wrote:
> On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
> >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> >> > citext unfortunately doesn't allow for index optimization of LIKE
> >> > queries, which IMNSHO defeats the whole purpose.  to the best way
> >> > remains to use lower() ...
> >> > this will be index optimized and fast as long as you specified C
> >> > locale for your database.
> >>
> >> What is the difference between C and en_US.UTF8, please?  We see that
> >> the same query (that invokes a sort) runs 15% faster under the C
> >> locale.  The output between C and en_US.UTF8 is identical.  We're
> >> considering moving our database from en_US.UTF8 to C, but we do deal
> >> with internationalized text.
> >
> > Well, C has reduced overhead for string comparisons, but obviously
> > doesn't work well for international characters.  The single-byte
> > encodings have somewhat less overhead than UTF8.  You can try using C
> > locales for databases that don't require non-ASCII characters.
>
> To add:
> The middle ground I usually choose is to have a database encoding of
> UTF8 but with the C (aka POSIX) locale.  This gives you the ability to
> store any unicode but indexing operations will use the faster C string
> comparison operations for a significant performance boost --
> especially for partial string searches on an indexed column.  This is
> an even more attractive option in 9.1 with the ability to specify
> specific collations at runtime.

Do you get proper sort ordering in this case, or only when you specific
the proper collation at runtime?
--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Bruce Momjian

Date:

29 August 2012, 20:17:14

On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote:
> On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
> >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> >> > citext unfortunately doesn't allow for index optimization of LIKE
> >> > queries, which IMNSHO defeats the whole purpose.  to the best way
> >> > remains to use lower() ...
> >> > this will be index optimized and fast as long as you specified C
> >> > locale for your database.
> >>
> >> What is the difference between C and en_US.UTF8, please?  We see that
> >> the same query (that invokes a sort) runs 15% faster under the C
> >> locale.  The output between C and en_US.UTF8 is identical.  We're
> >> considering moving our database from en_US.UTF8 to C, but we do deal
> >> with internationalized text.
> >
> > Well, C has reduced overhead for string comparisons, but obviously
> > doesn't work well for international characters.  The single-byte
> > encodings have somewhat less overhead than UTF8.  You can try using C
> > locales for databases that don't require non-ASCII characters.
>
> I think you're confusing encodings with locales.  C is a locale. You

I think technically C is a non-locale.

> can have a database with a locale of C and UTF-8 encoding.
>
> create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;
>
> \l
>      Name     |  Owner   | Encoding  |   Collate   |    Ctype    |
> Access privileges
> --------------+----------+-----------+-------------+-------------+-----------------------
>  clocale_utf8 | smarlowe | UTF8      | C           | en_US.UTF-8 |
>
>
> SQL_ASCII is the encoding equivalent of C locale, but it also allows
> multi-byte characters.

Yes, but what sort ordering do you get in that case?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Scott Marlowe

Date:

29 August 2012, 20:32:38

On Wed, Aug 29, 2012 at 2:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote:
>> On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> > On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
>> >> On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> >> > citext unfortunately doesn't allow for index optimization of LIKE
>> >> > queries, which IMNSHO defeats the whole purpose.  to the best way
>> >> > remains to use lower() ...
>> >> > this will be index optimized and fast as long as you specified C
>> >> > locale for your database.
>> >>
>> >> What is the difference between C and en_US.UTF8, please?  We see that
>> >> the same query (that invokes a sort) runs 15% faster under the C
>> >> locale.  The output between C and en_US.UTF8 is identical.  We're
>> >> considering moving our database from en_US.UTF8 to C, but we do deal
>> >> with internationalized text.
>> >
>> > Well, C has reduced overhead for string comparisons, but obviously
>> > doesn't work well for international characters.  The single-byte
>> > encodings have somewhat less overhead than UTF8.  You can try using C
>> > locales for databases that don't require non-ASCII characters.
>>
>> I think you're confusing encodings with locales.  C is a locale. You
>
> I think technically C is a non-locale.

True.  But it's NOT an encoding.

>> can have a database with a locale of C and UTF-8 encoding.
>>
>> create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;
>>
>> \l
>>      Name     |  Owner   | Encoding  |   Collate   |    Ctype    |
>> Access privileges
>> --------------+----------+-----------+-------------+-------------+-----------------------
>>  clocale_utf8 | smarlowe | UTF8      | C           | en_US.UTF-8 |
>>
>>
>> SQL_ASCII is the encoding equivalent of C locale, but it also allows
>> multi-byte characters.
>
> Yes, but what sort ordering do you get in that case?

Byte ordering.

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Jasen Betts

Date:

02 September 2012, 07:35:08

On 2012-08-29, Bruce Momjian <bruce@momjian.us> wrote:

>>      Name     |  Owner   | Encoding  |   Collate   |    Ctype    |
>> Access privileges
>> --------------+----------+-----------+-------------+-------------+-----------------------
>>  clocale_utf8 | smarlowe | UTF8      | C           | en_US.UTF-8 |
>>
>>
>> SQL_ASCII is the encoding equivalent of C locale, but it also allows
>> multi-byte characters.
>
> Yes, but what sort ordering do you get in that case?
>

C ordering is by unicode code point number.

c=# values ('a'),('Z'),('€'),('z'),('¢'),('ä'),('Ā') order by 1;
 column1
---------
 Z
 a
 z
 ¢
 ä
 Ā
 €

--
⚂⚃ 100% natural

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Peter Eisentraut

Date:

03 September 2012, 05:08:34

On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote:
> What is the difference between C and en_US.UTF8, please?

There are many differences, but here is a simple one:

$ (echo a; echo A; echo b; echo B) | LC_ALL=C sort
A
B
a
b
$ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort
a
A
b
B

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

From

Aleksey Tsalolikhin

Date:

21 September 2012, 21:26:54

On Sun, Sep 2, 2012 at 10:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote:
>> What is the difference between C and en_US.UTF8, please?
>
> There are many differences, but here is a simple one:
>
> $ (echo a; echo A; echo b; echo B) | LC_ALL=C sort
> ...
> $ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort
> ...

Beautifully illustrated, and in a language I readily understand as a UNIX
sys admin.  Thank you, Peter!

Thank you, Bruce, Merlin, and everybody else who replied.

I love the flexibility 9.1 allows to have a default sort locale with ability
to alter locale per-query if needed.  It gives us the confidence to move forward
with switching to C for sorting -- right now we only need to sort on US
English data, so this is a perfect fit.  When we internationalize, 9.1
allows us to switch back to UTF8-based sorting if needed.  I love it!

Thank you!
Aleksey