Re: langauges, locales, regex, LIKE - Mailing list pgsql-general

From John Sidney-Woollett
Subject Re: langauges, locales, regex, LIKE
Date
Msg-id 40DA83C6.3020203@wardbrook.com
Whole thread Raw
In response to langauges, locales, regex, LIKE  (Dennis Gearon <gearond@fireserve.net>)
Responses Re: langauges, locales, regex, LIKE  (Dennis Gearon <gearond@fireserve.net>)
List pgsql-general
For what it's worth, we have a unicode 7.4.1 database which gives us the
sorting and searching behaviour that we expect (with the exception of
the upper and lower functions). We access the data via jdbc so we don't
have to deal with encoding issues per se as the driver does any
translation for us.

Currently we don't use any LIKE statements, but if we did, and wanted
them optimized then we'd use the appropriate OP Class when defining the
index. We also don't use any REGEX expressions. And we'll shortly be
experimenting with tsearch2...

         List of databases
     Name      |  Owner   | Encoding
---------------+----------+----------
test          | postgres | UNICODE

Setting the psql client encoding to Latin1 and inserting the following
data...

# select * from johntest;
 id | value
----+-------
  1 | test
  2 | tést
  3 | tèst
  4 | taste
  5 | TEST
  6 | TÉST
  7 | TÈST
  8 | TASTE
(8 rows)

and then extracting the data in sorted order works as we would expect

# select * from johntest order by value (no index on the value field)
 id | value
----+-------
  8 | TASTE
  5 | TEST
  7 | TÈST
  6 | TÉST
  4 | taste
  1 | test
  3 | tèst
  2 | tést
(8 rows)

however, applying the UPPER function to the data does not work as
expected, problem with ids 6,7,3,2 - should be ordered (3,7 or 7,3) ,
(6,2 or 2,6)

# select * from johntest order by upper(value);
 id | value
----+-------
  4 | taste
  8 | TASTE
  1 | test
  5 | TEST
  7 | TÈST
  6 | TÉST
  3 | tèst
  2 | tést
(8 rows)

using a LIKE operation also works as expected (again no index on value
field)

# select * from johntest where value like 't%';
 id | value
----+-------
  1 | test
  2 | tést
  3 | tèst
  4 | taste
(4 rows)

Here's our pg_controldata output:
version number:            72
Catalog version number:               200310211
Database cluster state:               in production
pg_control last modified:             Thu 24 Jun 2004 07:18:56 GMT
Current log file ID:                  0
Next log file segment:                29
Latest checkpoint location:           0/1CA5F8D8
Prior checkpoint location:            0/1C8F2074
Latest checkpoint's REDO location:    0/1CA5F8D8
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        17
Latest checkpoint's NextXID:          42355483
Latest checkpoint's NextOID:          29814
Time of latest checkpoint:            Thu 24 Jun 2004 07:18:54 GMT
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

and our locale is

locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

We are intending to support various European languages on our website,
and so far the unicode seems to be working for us.

But maybe we' re just happy in our ignorance!

John Sidney-Woollett

Dennis Gearon wrote:

> If I've read everything right, in order to get:
>
>     multiple languages on a site
>
> with the functionality of ALL of:
>
>     REGEX
>     LIKE
>     Correctly sorted text
>
> A site would have to:
>
>     create a cluster for every language needed
>     run a separate database instance for every language
>     and have the database instances each have their own port
>     and use 8 bit encoding for that specific language
>
> because:
>
>     Sorting is fixed at cluster/directory creation per single
>         database instance
>     And LIKE only works on C Locale with an eight bit encoding
>     and sorting (MAYBE?) works only on 8 bit encoding
>     when using C Locale.
>
> If anyone can correct me on this, I'd love to hear it.
>
> Boy, the old LOCALE system has really got to go someday.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend


pgsql-general by date:

Previous
From: "Dann Corbit"
Date:
Subject: Re: Performance
Next
From: Michael Glaesemann
Date:
Subject: Re: and here is a free OSS library to do ....