Thread: [GENERAL] Text search dictionary vs. the C locale
I am having problems creating an Ispell-based text search dictionary for Czech language.
Issuing the following command:
ends with
The dictionary files are in UTF-8. The database cluster was initialized with
When, on the other hand, I initialize it with
it works.
I was hoping I could have the C locale with the UTF-8 encoding but it seems non-ASCII text search dictionaries are not supported in that case. This is a shame as restoring the dumps goes from 1.5 hour (with the C locale) to 9.5 hours (with en_US.UTF8).
View this message in context: Text search dictionary vs. the C locale
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Issuing the following command:
create text search dictionary czech_ispell (
template = ispell,
dictfile = czech_ispell,
affFile = czech_ispell
);
ends with
ERROR: syntax error
CONTEXT: line 252 of configuration file "/usr/share/postgresql/9.6/tsearch_data/czech_ispell.affix": " . > TŘIA
The dictionary files are in UTF-8. The database cluster was initialized with
initdb --locale=C --encoding=UTF8
When, on the other hand, I initialize it with
initdb --locale=en_US.UTF8
it works.
I was hoping I could have the C locale with the UTF-8 encoding but it seems non-ASCII text search dictionaries are not supported in that case. This is a shame as restoring the dumps goes from 1.5 hour (with the C locale) to 9.5 hours (with en_US.UTF8).
View this message in context: Text search dictionary vs. the C locale
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Initializing the cluster with
allows me to use my text search dictionary. Now it only remains to see whether index creation will be still fast (I suspect it should) and if it doesn't have any other unintended consequences (e.g. in pattern matching which we use a lot).
View this message in context: Re: Text search dictionary vs. the C locale
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
initdb
--locale=C
--lc-ctype=en_US.UTF-8
--lc-messages=en_US.UTF-8
--lc-monetary=en_US.UTF-8
--lc-numeric=en_US.UTF-8
--lc-time=en_US.UTF-8
--encoding=UTF8
allows me to use my text search dictionary. Now it only remains to see whether index creation will be still fast (I suspect it should) and if it doesn't have any other unintended consequences (e.g. in pattern matching which we use a lot).
View this message in context: Re: Text search dictionary vs. the C locale
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
twoflower <standa.kurik@gmail.com> writes: > I am having problems creating an Ispell-based text search dictionary for > Czech language. > Issuing the following command: > create text search dictionary czech_ispell ( > template = ispell, > dictfile = czech_ispell, > affFile = czech_ispell > ); > ends with > ERROR: syntax error > CONTEXT: line 252 of configuration file > "/usr/share/postgresql/9.6/tsearch_data/czech_ispell.affix": " . > TŘIA > The dictionary files are in UTF-8. The database cluster was initialized with > initdb --locale=C --encoding=UTF8 Presumably the problem is that the dictionary file parsing functions reject anything that doesn't satisfy t_isalpha() (unless it matches t_isspace()) and in C locale that's not going to accept very much. I wonder why we're doing it like that. It seems like it'd often be useful to load dictionary files that don't match the database's prevailing locale. Do we really need the t_isalpha tests, or would it be good enough to assume that anything that isn't t_isspace is part of a word? regards, tom lane
Sent from my iPad > On Jul 2, 2017, at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > twoflower <standa.kurik@gmail.com> writes: >> I am having problems creating an Ispell-based text search dictionary for >> Czech language. > >> Issuing the following command: > >> create text search dictionary czech_ispell ( >> template = ispell, >> dictfile = czech_ispell, >> affFile = czech_ispell >> ); > >> ends with > >> ERROR: syntax error >> CONTEXT: line 252 of configuration file >> "/usr/share/postgresql/9.6/tsearch_data/czech_ispell.affix": " . > TŘIA > >> The dictionary files are in UTF-8. The database cluster was initialized with >> initdb --locale=C --encoding=UTF8 > > Presumably the problem is that the dictionary file parsing functions > reject anything that doesn't satisfy t_isalpha() (unless it matches > t_isspace()) and in C locale that's not going to accept very much. > > I wonder why we're doing it like that. It seems like it'd often be > useful to load dictionary files that don't match the database's > prevailing locale. Do we really need the t_isalpha tests, or would > it be good enough to assume that anything that isn't t_isspace is > part of a word? > > regards, tom lane > What about punctuation? > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
> On Jul 2, 2017, at 10:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > twoflower <standa.kurik@gmail.com> writes: >> I am having problems creating an Ispell-based text search dictionary for >> Czech language. > >> Issuing the following command: > >> create text search dictionary czech_ispell ( >> template = ispell, >> dictfile = czech_ispell, >> affFile = czech_ispell >> ); > >> ends with > >> ERROR: syntax error >> CONTEXT: line 252 of configuration file >> "/usr/share/postgresql/9.6/tsearch_data/czech_ispell.affix": " . > TŘIA > >> The dictionary files are in UTF-8. The database cluster was initialized with >> initdb --locale=C --encoding=UTF8 > > Presumably the problem is that the dictionary file parsing functions > reject anything that doesn't satisfy t_isalpha() (unless it matches > t_isspace()) and in C locale that's not going to accept very much. > > I wonder why we're doing it like that. It seems like it'd often be > useful to load dictionary files that don't match the database's > prevailing locale. Do we really need the t_isalpha tests, or would > it be good enough to assume that anything that isn't t_isspace is > part of a word? > > regards, tom lane > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general Apologies for truncating entire body of replied-to post