Thread: Doc patch needed: encodings?

Doc patch needed: encodings?

From
Josh Berkus
Date:
Folks,

I've noticed in recent testing that current versions of both RedHat and SuSE
default to encodings of "en_US.UTF-8" in the US.   Presumably they do
something corresponding in other countries.

I think this means that in the basic install-from-source instructions, we need
to warn people to compile with --locale='C'.   I know it caught me by
surprise when indexing for LIKE stopped working, and I wasted a lot of time
hunting for a PG bug when it was my SuSE upgrade at fault.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Doc patch needed: encodings?

From
Peter Eisentraut
Date:
Josh Berkus wrote:
> I've noticed in recent testing that current versions of both RedHat
> and SuSE default to encodings of "en_US.UTF-8" in the US.
> Presumably they do something corresponding in other countries.
>
> I think this means that in the basic install-from-source
> instructions, we need to warn people to compile with --locale='C'.

I think not.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Doc patch needed: encodings?

From
Josh Berkus
Date:
Peter,

> > I think this means that in the basic install-from-source
> > instructions, we need to warn people to compile with --locale='C'.  
>
> I think not.

Care to explain that?   Why shouldn't we warn people?

Otherwise we're set for 12,000 e-mails to the SQL list on "why doesn't this
query use and index, it did on my old machine".

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Doc patch needed: encodings?

From
Peter Eisentraut
Date:
Josh Berkus wrote:
> Otherwise we're set for 12,000 e-mails to the SQL list on "why
> doesn't this query use and index, it did on my old machine".

Linux distributions have been shipping with non-C locale settings for a
long, long time, so that complaint would be invalid.  And even back
then we figured that recommending that people turn locale off was not
an acceptable "solution".

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Doc patch needed: encodings?

From
Josh Berkus
Date:
Peter,

> Linux distributions have been shipping with non-C locale settings for a
> long, long time, so that complaint would be invalid.

In US distributions it's a recent thing.   The switch to non-C locales is a
recent thing; RH Enterprise 3.0, and SuSE 9.0.  But I expect in Europe you've
been using non-C locales for a while.

I'd like to have an explanation of this somewhere else newbies are liable to
read it, *before* their first production "LIKE" query doesn't use an index.
Where would be appropriate?

And, for English speakers, what exactly is wrong with using 'C' locale instead
of the environment one?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Doc patch needed: encodings?

From
Robert Treat
Date:
On Monday 06 December 2004 13:39, Josh Berkus wrote:
> I'd like to have an explanation of this somewhere else newbies are liable
> to read it, *before* their first production "LIKE" query doesn't use an
> index. Where would be appropriate?
>

It's buried in the faq -> http://www.postgresql.org/docs/faqs/FAQ.html#4.8

--
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

Re: Doc patch needed: encodings?

From
Peter Eisentraut
Date:
Josh Berkus wrote:
> In US distributions it's a recent thing.   The switch to non-C
> locales is a recent thing; RH Enterprise 3.0, and SuSE 9.0.

I have it on record that Red Hat has set a non-C locale by default at
least since Red Hat 6.1 as distributed in North America (aren't they
the same anyway?) in 1999.  I know that because we had this exact
discussion back then.

> I'd like to have an explanation of this somewhere else newbies are
> liable to read it, *before* their first production "LIKE" query
> doesn't use an index. Where would be appropriate?

Near the documentation of "LIKE".

> And, for English speakers, what exactly is wrong with using 'C'
> locale instead of the environment one?

It makes it difficult to write a résumé, to name one thing.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Doc patch needed: encodings?

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Josh Berkus wrote:
>> I'd like to have an explanation of this somewhere else newbies are
>> liable to read it, *before* their first production "LIKE" query
>> doesn't use an index. Where would be appropriate?

> Near the documentation of "LIKE".

I think it would be fair to mention this somewhere near the discussion
of creating a database cluster, too.  The existing documentation does
warn you that sort order may be affected by your choice, but there is
nothing anywhere near that section to suggest that LIKE performance
might be affected.  A para in the "Locale Support" section (in
charset.sgml) would probably be appropriate, and maybe another word or
two in the place that link to it in runtime.sgml and ref/initdb.sgml.

            regards, tom lane

Re: Doc patch needed: encodings?

From
Bruce Momjian
Date:
I have applied the following patch to mention non-C locales affect LIKE.

---------------------------------------------------------------------------

Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Josh Berkus wrote:
> >> I'd like to have an explanation of this somewhere else newbies are
> >> liable to read it, *before* their first production "LIKE" query
> >> doesn't use an index. Where would be appropriate?
>
> > Near the documentation of "LIKE".
>
> I think it would be fair to mention this somewhere near the discussion
> of creating a database cluster, too.  The existing documentation does
> warn you that sort order may be affected by your choice, but there is
> nothing anywhere near that section to suggest that LIKE performance
> might be affected.  A para in the "Locale Support" section (in
> charset.sgml) would probably be appropriate, and maybe another word or
> two in the place that link to it in runtime.sgml and ref/initdb.sgml.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: doc/src/sgml/charset.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v
retrieving revision 2.47
diff -c -c -r2.47 charset.sgml
*** doc/src/sgml/charset.sgml    27 Dec 2004 22:30:10 -0000    2.47
--- doc/src/sgml/charset.sgml    4 Jan 2005 00:02:40 -0000
***************
*** 189,198 ****
    </sect2>

    <sect2>
!    <title>Benefits</>

     <para>
!     Locale support influences in particular the following features:

      <itemizedlist>
       <listitem>
--- 189,198 ----
    </sect2>

    <sect2>
!    <title>Behavior</>

     <para>
!     Locale support influences the following features:

      <itemizedlist>
       <listitem>
***************
*** 204,209 ****
--- 204,216 ----

       <listitem>
        <para>
+        The ability to use indexes with <literal>LIKE</> clauses
+        <indexterm><primary>LIKE</><secondary>and locales</></indexterm>
+       </para>
+      </listitem>
+
+      <listitem>
+       <para>
         The <function>to_char</> family of functions
        </para>
       </listitem>
***************
*** 211,219 ****
     </para>

     <para>
!     The only severe drawback of using the locale support in
!     <productname>PostgreSQL</> is its speed.  So use locales only if
!     you actually need them.
     </para>
    </sect2>

--- 218,228 ----
     </para>

     <para>
!     The drawback of using locales other than <literal>C</> or
!     <literal>POSIX</> in <productname>PostgreSQL</> is its performance
!     impact. It slows character handling and prevents ordinary indexes
!     from being used by <literal>LIKE</>. For this reason use locales
!     only if you actually need them.
     </para>
    </sect2>

Index: doc/src/sgml/runtime.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v
retrieving revision 1.299
diff -c -c -r1.299 runtime.sgml
*** doc/src/sgml/runtime.sgml    26 Dec 2004 23:06:56 -0000    1.299
--- doc/src/sgml/runtime.sgml    4 Jan 2005 00:02:56 -0000
***************
*** 144,152 ****
     that can be found in <xref linkend="locale">.  The sort order used
     within a particular database cluster is set by
     <command>initdb</command> and cannot be changed later, short of
!    dumping all data, rerunning <command>initdb</command>, and
!    reloading the data. So it's important to make this choice correctly
!    the first time.
    </para>
   </sect1>

--- 144,153 ----
     that can be found in <xref linkend="locale">.  The sort order used
     within a particular database cluster is set by
     <command>initdb</command> and cannot be changed later, short of
!    dumping all data, rerunning <command>initdb</command>, and reloading
!    the data. There is also a performance impact for using locales
!    other than <literal>C</> or <literal>POSIX</>. Therefore, it is
!    important to make this choice correctly the first time.
    </para>
   </sect1>

Index: doc/src/sgml/ref/initdb.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/initdb.sgml,v
retrieving revision 1.32
diff -c -c -r1.32 initdb.sgml
*** doc/src/sgml/ref/initdb.sgml    1 Aug 2004 06:19:18 -0000    1.32
--- doc/src/sgml/ref/initdb.sgml    4 Jan 2005 00:02:57 -0000
***************
*** 54,74 ****
    </para>

    <para>
!    <command>initdb</command> initializes the database cluster's
!    default locale and character set encoding.  Some locale categories
!    are fixed for the lifetime of the cluster, so it is important to
!    make the right choice when running <command>initdb</command>.
!    Other locale categories can be changed later when the server is
!    started.  <command>initdb</command> will write those locale
!    settings into the <filename>postgresql.conf</filename>
!    configuration file so they are the default, but they can be changed
!    by editing that file.  To set the locale that
!    <command>initdb</command> uses, see the description of the
!    <option>--locale</option> option.  The character set encoding can
     be set separately for each database as it is created.
     <command>initdb</command> determines the encoding for the
     <literal>template1</literal> database, which will serve as the
!    default for all other databases.  To alter the default encoding use
     the <option>--encoding</option> option.
    </para>

--- 54,75 ----
    </para>

    <para>
!    <command>initdb</command> initializes the database cluster's default
!    locale and character set encoding. Some locale categories are fixed
!    for the lifetime of the cluster. There is also a performance impact
!    in using locales other than <literal>C</> or <literal>POSIX</>.
!    Therefore it is important to make the right choice when running
!    <command>initdb</command>. Other locale categories can be changed
!    later when the server is started. <command>initdb</command> will
!    write those locale settings into the
!    <filename>postgresql.conf</filename> configuration file so they are
!    the default, but they can be changed by editing that file. To set the
!    locale that <command>initdb</command> uses, see the description of
!    the <option>--locale</option> option. The character set encoding can
     be set separately for each database as it is created.
     <command>initdb</command> determines the encoding for the
     <literal>template1</literal> database, which will serve as the
!    default for all other databases. To alter the default encoding use
     the <option>--encoding</option> option.
    </para>


Re: Doc patch needed: encodings?

From
Peter Eisentraut
Date:
Bruce Momjian wrote:
> I have applied the following patch to mention non-C locales affect
> LIKE.

I don't think it's accurate to say that there is a performance impact.
The only impact is that you need to create a different kind of index.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Doc patch needed: encodings?

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Bruce Momjian wrote:
> > I have applied the following patch to mention non-C locales affect
> > LIKE.
>
> I don't think it's accurate to say that there is a performance impact.
> The only impact is that you need to create a different kind of index.

Well, there is a performance impact to using the locale functions, and
you can't use ordinary indexes for LIKE, just special ones.  Is there
better text I should use?

I used "performance impact" in the reference text to suggest there is an
issue and they might want to read the locale section.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073