The following documentation comment has been logged on the website:
Page: https://www.postgresql.org/docs/11/textsearch-debugging.html
Description:
It would be helpful if there were some documentation on how to query the
dictionaries themselves, to get a canonical root word, either.
1. Directly, such as:
"SELECT words FROM english_stem WHERE stem = 'chlorin'
-- should return e.g. "chlorine", "chlorination", "chlorinated"
-- there isn't any documentation on how to actually do this.
2. Indirectly, such as:
"SELECT ts_unlexize('english_stem','chlorin');
-- this is a function which doesn't yet seem to exist: the one-to-many
inverse of ts_lexize().
3. Or, the canonical version of (2).
"SELECT ts_canonical('english_stem','chlorin');
--a one to one function to find the english root word (not the lexeme).
An example of where this is useful: consider a list of documents, containing
a large amount of english text.
For this example, consider that the following words are frequent: "the",
"kitten", "kittens", "chlorination", "chlorinated", "temperature" and
"something".
We wish to display a "tag cloud" of the most common terms, excluding
stopwords, by means of ts_stat().
At the moment, it lists:
"kitten" -- correctly treating "kitten" and "kittens" as the
same.
"chlorin" -- correctly merging "chlorination" and "chlorinated",
but creating a non-word.
"temperatur" -- right stem, not a word.
"someth" -- mistaken parser, has removed the -ing suffix.
So, given the array ["kitten","chlorin","temperatur","someth"], we wish to
un-stem to find the first valid english word whose stem is in that array,
i.e.
["kitten", "chlorine", "temperature", "something"]
Note that it is intentional to retrieve "chlorine" even though the original
inputs were "chlorinated" and "chlorination", and did not necessarily
contain "chlorine"]
There doesn't seem to be any process for doing this. Not sure whether this
is just something for the documentation, or an RFE for (2). Thanks very
much.