Thread: Collation
Hey guys,
I'm trying to adopt an application to use PostgresQL. Its a ruby on rails app and with ActiveRecord, the transition is mostly straight forward. There is one issue though where I got stuck, perhaps you can help:
One aspect of the app is to store international items. For example a table "people" might contain "Albrecht Dürer" and "Paul Cézanne" in the same column. Now, with MySQL, all of the following have at least one result:
SELECT * FROM people WHERE name LIKE '%albrecht%';
SELECT * FROM people WHERE name LIKE '%Anne%';
SELECT * FROM people WHERE name LIKE '%durer%';
SELECT * FROM people WHERE name LIKE '%DÜRER%';
SELECT * FROM people WHERE name LIKE '%cezanne%';
SELECT * FROM people WHERE name LIKE '%Álbrecht%';
I owe that to the collation utf8_general_ci which seems to reasonably fold a wide range of characters.
For PostrgesQL, I'm using the official docker container:
and I tried to use the provided hook to essentially run the following bash commands:
echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen
echo "de_DE.UTF-8 UTF-8" >> /etc/locale.gen
locale-gen
AFTER the cluster is created. So I understand that initdb uses en_US.UTF-8 for a locale. I also went ahead and:
CREATE COLLATION german (LOCALE = 'de_DE.utf8');
My problem is that I don't (even) get the german collation to work, the following query yields 'f':
SELECT 'a' COLLATE "german" LIKE 'ä' COLLATE "german";
Shouldn't that be working? Also, the way I understand the docs, there is not really a way to collate with multiple locales on the same field? This would mean that there is no equivalent of utf8_general_ci in PostgresQL?
I hope I'm clear and thanks for any help!
Moritz
Moritz Schepp <moritz.schepp@gmail.com> writes: > My problem is that I don't (even) get the german collation to work, the > following query yields 'f': > SELECT 'a' COLLATE "german" LIKE 'ä' COLLATE "german"; > Shouldn't that be working? No. Postgres interprets COLLATE as having to do only with sort ordering. What you want here, IIUC, is to do accent-stripping on the data and/or the LIKE pattern before you compare them. Take a look at the contrib/unaccent module for one solution. regards, tom lane
Thanks Tom for the quick answer!
that's good to know and the unaccent module seems like it should do the folding. However, doesn't that mean that a specific index would have to be created for every field I want to use this for (unless a text search context would be used)?
Regards,
Moritz
On Sat, Jun 11, 2016 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Moritz Schepp <moritz.schepp@gmail.com> writes:
> My problem is that I don't (even) get the german collation to work, the
> following query yields 'f':
> SELECT 'a' COLLATE "german" LIKE 'ä' COLLATE "german";
> Shouldn't that be working?
No. Postgres interprets COLLATE as having to do only with sort ordering.
What you want here, IIUC, is to do accent-stripping on the data and/or the
LIKE pattern before you compare them. Take a look at the contrib/unaccent
module for one solution.
regards, tom lane
Moritz Schepp <moritz.schepp@gmail.com> writes: > that's good to know and the unaccent module seems like it should do the > folding. However, doesn't that mean that a specific index would have to be > created for every field I want to use this for (unless a text search > context would be used)? I don't see what an index would have to do with this. You'd probably use the unaccent() function. regards, tom lane