Thread: locales and encodings Oh MY!

locales and encodings Oh MY!

From
"mark"
Date:
Hi all,

So today while I investigating some slow queries I found out that some time
ago some of our databases (due, innocently I suspect, to upgrades or
hardware migrations) are now not using C as lc_ctype. And why some 'things'
got a lot slower...

To correct this problem it seems like I am going to have to do a dump,
intidb, then restore.  (or add some new indexes and eat the extra overhead).
Unless someone knows something I don't about changing ctype and collation
with some sleight of hand (if you do I am very interested)


I have listed what I think I will be doing with regards to initdb. if anyone
sees problems with the following mixture during my dump -> init-> restore I
would be most keen in hearing about it.



initdb /path/to/data/dir --lc_ctype=C --lc_collation=C
--lc_message=en_US.UTF8 --lc_monetary=en_US.UTF8 --lc_numeric=en_US.UTF8
--lc_time=en_US.UTF8 -E UTF8




Thank you,

Mark


Re: locales and encodings Oh MY!

From
Gabriele Bartolini
Date:
Hi Mark,

Il 12/11/10 03:31, mark ha scritto:
> I have listed what I think I will be doing with regards to initdb. if anyone
> sees problems with the following mixture during my dump ->  init->  restore I
> would be most keen in hearing about it.
FYI, PostgreSQL 8.4 introduced database level collation. Before that,
yes, you need to re-init your data dir.
> initdb /path/to/data/dir --lc_ctype=C --lc_collation=C
> --lc_message=en_US.UTF8 --lc_monetary=en_US.UTF8 --lc_numeric=en_US.UTF8
> --lc_time=en_US.UTF8 -E UTF8

Maybe you meant --lc_collate ?

Cheers,
Gabriele

--
  Gabriele Bartolini - 2ndQuadrant Italia
  PostgreSQL Training, Services and Support
  gabriele.bartolini@2ndQuadrant.it | www.2ndQuadrant.it


Re: locales and encodings Oh MY!

From
mark
Date:
On Fri, Nov 12, 2010 at 12:45 AM, Gabriele Bartolini
<gabriele.bartolini@2ndquadrant.it> wrote:
> Hi Mark,
>
> Il 12/11/10 03:31, mark ha scritto:
>>
>> I have listed what I think I will be doing with regards to initdb. if
>> anyone
>> sees problems with the following mixture during my dump ->  init->
>>  restore I
>> would be most keen in hearing about it.
>
> FYI, PostgreSQL 8.4 introduced database level collation. Before that, yes,
> you need to re-init your data dir.

Guess I missed that, I have 8.3.X boxes in production and 9.0.1 boxes
in dev so I guess only someone of them will require a re-init.

>>
>> initdb /path/to/data/dir --lc_ctype=C --lc_collation=C
>> --lc_message=en_US.UTF8 --lc_monetary=en_US.UTF8 --lc_numeric=en_US.UTF8
>> --lc_time=en_US.UTF8 -E UTF8
>
> Maybe you meant --lc_collate ?
>

Yes I did mean lc_collate - thanks

> Cheers,
> Gabriele
>

With how similar straight C and en_US.UTF8 are it was suggested to me,
by persons who are far more C knowledgeable then I in my office, that
this is something the PG community could "fix" . A "fix" being so that
"col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
(and probably some others).


@hackers ->
is the request unreasonable ? anyone got any idea of the price tag to
make that happen ?



> --
>  Gabriele Bartolini - 2ndQuadrant Italia
>  PostgreSQL Training, Services and Support
>  gabriele.bartolini@2ndQuadrant.it | www.2ndQuadrant.it
>
>

Re: locales and encodings Oh MY!

From
"Kevin Grittner"
Date:
mark  wrote:

> A "fix" being so that "col LIKE 'foo%' " could use btree indexes in
> locales like en_US.UTF8 (and probably some others).

How about specifying an opclass?:

http://www.postgresql.org/docs/current/interactive/indexes-opclass.html

-Kevin

Re: [HACKERS] locales and encodings Oh MY!

From
Robert Haas
Date:
> With how similar straight C and en_US.UTF8 are it was suggested to me,
> by persons who are far more C knowledgeable then I in my office, that
> this is something the PG community could "fix" . A "fix" being so that
> "col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
> (and probably some others).
>
>
> @hackers ->
> is the request unreasonable ? anyone got any idea of the price tag to
> make that happen ?

I thought it already did that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] locales and encodings Oh MY!

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
>> With how similar straight C and en_US.UTF8 are it was suggested to me,
>> by persons who are far more C knowledgeable then I in my office, that
>> this is something the PG community could "fix" . A "fix" being so that
>> "col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
>> (and probably some others).

>> is the request unreasonable ? anyone got any idea of the price tag to
>> make that happen ?

> I thought it already did that.

No, and the odds of it ever happening are insignificant.  The sort order
associated with en_US (and other "dictionary order" locales) is just too
randomly different from what you need to optimize a LIKE search.
(Whoever told you en_US sorts similarly to C is nuts.)

The solution if you want the database's prevailing sort order to be en_US
is to put an extra text_pattern_ops index on the column you want to do
LIKE searches on.  We might eventually have the ability to spell that
"put a C-locale index on the column", but text_pattern_ops is the way to
do it today.

            regards, tom lane

Re: [HACKERS] locales and encodings Oh MY!

From
mark
Date:
On Fri, Nov 12, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>>> With how similar straight C and en_US.UTF8 are it was suggested to me,
>>> by persons who are far more C knowledgeable then I in my office, that
>>> this is something the PG community could "fix" . A "fix" being so that
>>> "col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
>>> (and probably some others).
>
>>> is the request unreasonable ? anyone got any idea of the price tag to
>>> make that happen ?
>
>> I thought it already did that.
>
> No, and the odds of it ever happening are insignificant.  The sort order
> associated with en_US (and other "dictionary order" locales) is just too
> randomly different from what you need to optimize a LIKE search.
> (Whoever told you en_US sorts similarly to C is nuts.)
>
> The solution if you want the database's prevailing sort order to be en_US
> is to put an extra text_pattern_ops index on the column you want to do
> LIKE searches on.  We might eventually have the ability to spell that
> "put a C-locale index on the column", but text_pattern_ops is the way to
> do it today.
>
>                        regards, tom lane
>

Ok I hear you loud and clear.    I am going to eat the overhead until
I get to 9.0.1, currently on 8.3.X in some places.

I will either take an outage and do a dump - re-init-restore or
inplace upgrade and then do some locking, copy, drop old, rename new
db path.

thanks all.


..: Mark