Thread: Collation

Collation

From

Moritz Schepp

Date:

11 June 2016, 12:23:09

Hey guys,

I'm trying to adopt an application to use PostgresQL. Its a ruby on rails app and with ActiveRecord, the transition is mostly straight forward. There is one issue though where I got stuck, perhaps you can help:

One aspect of the app is to store international items. For example a table "people" might contain "Albrecht Dürer" and "Paul Cézanne" in the same column. Now, with MySQL, all of the following have at least one result:

SELECT * FROM people WHERE name LIKE '%albrecht%';

SELECT * FROM people WHERE name LIKE '%Anne%';

SELECT * FROM people WHERE name LIKE '%durer%';

SELECT * FROM people WHERE name LIKE '%DÜRER%';

SELECT * FROM people WHERE name LIKE '%cezanne%';

SELECT * FROM people WHERE name LIKE '%Álbrecht%';

I owe that to the collation utf8_general_ci which seems to reasonably fold a wide range of characters.

For PostrgesQL, I'm using the official docker container:

https://hub.docker.com/_/postgres/

and I tried to use the provided hook to essentially run the following bash commands:

echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen

echo "de_DE.UTF-8 UTF-8" >> /etc/locale.gen

locale-gen

AFTER the cluster is created. So I understand that initdb uses en_US.UTF-8 for a locale. I also went ahead and:

CREATE COLLATION german (LOCALE = 'de_DE.utf8');

My problem is that I don't (even) get the german collation to work, the following query yields 'f':

SELECT 'a' COLLATE "german" LIKE 'ä' COLLATE "german";

Shouldn't that be working? Also, the way I understand the docs, there is not really a way to collate with multiple locales on the same field? This would mean that there is no equivalent of utf8_general_ci in PostgresQL?

I hope I'm clear and thanks for any help!

Moritz

Re: Collation

From

Tom Lane

Date:

11 June 2016, 15:43:16

Moritz Schepp <moritz.schepp@gmail.com> writes:
> My problem is that I don't (even) get the german collation to work, the
> following query yields 'f':

> SELECT 'a' COLLATE "german" LIKE 'ä' COLLATE "german";

> Shouldn't that be working?

No.  Postgres interprets COLLATE as having to do only with sort ordering.
What you want here, IIUC, is to do accent-stripping on the data and/or the
LIKE pattern before you compare them.  Take a look at the contrib/unaccent
module for one solution.

            regards, tom lane

Re: Collation

From

Moritz Schepp

Date:

16 June 2016, 06:30:03

Thanks Tom for the quick answer!

that's good to know and the unaccent module seems like it should do the folding. However, doesn't that mean that a specific index would have to be created for every field I want to use this for (unless a text search context would be used)?

Regards,

Moritz

On Sat, Jun 11, 2016 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Moritz Schepp <moritz.schepp@gmail.com> writes:
> My problem is that I don't (even) get the german collation to work, the
> following query yields 'f':

> SELECT 'a' COLLATE "german" LIKE 'ä' COLLATE "german";

> Shouldn't that be working?

No. Postgres interprets COLLATE as having to do only with sort ordering.
What you want here, IIUC, is to do accent-stripping on the data and/or the
LIKE pattern before you compare them. Take a look at the contrib/unaccent
module for one solution.

regards, tom lane

Re: Collation

From

Tom Lane

Date:

16 June 2016, 14:57:54

Moritz Schepp <moritz.schepp@gmail.com> writes:
> that's good to know and the unaccent module seems like it should do the
> folding. However, doesn't that mean that a specific index would have to be
> created for every field I want to use this for (unless a text search
> context would be used)?

I don't see what an index would have to do with this.  You'd probably
use the unaccent() function.

            regards, tom lane