Home > mailing lists

How to query the underlying dictionary i.e. inverse of ts_lexize() - Mailing list pgsql-docs

From	PG Doc comments form
Subject	How to query the underlying dictionary i.e. inverse of ts_lexize()
Date	April 6, 2019 21:59:27
Msg-id	155457716793.719.16452998626279741513@wrigleys.postgresql.org Whole thread Raw
List	pgsql-docs

Tree view

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/textsearch-debugging.html
Description:

It would be helpful if there were some documentation on how to query the
dictionaries themselves, to get a canonical root word, either.

1. Directly, such as:
  "SELECT words FROM english_stem WHERE stem = 'chlorin'
 -- should return e.g. "chlorine", "chlorination", "chlorinated"
 -- there isn't any documentation on how to actually do this.

2. Indirectly, such as:
 "SELECT ts_unlexize('english_stem','chlorin');  
-- this is a function which doesn't yet seem to exist: the one-to-many
inverse of ts_lexize().

3. Or, the canonical version of (2).
"SELECT ts_canonical('english_stem','chlorin');
--a one to one function to find the english root word (not the lexeme).

An example of where this is useful: consider a list of documents, containing
a large amount of english text. 
For this example, consider that the following words are frequent: "the",
"kitten", "kittens", "chlorination", "chlorinated", "temperature" and
"something".

We wish to display a "tag cloud" of the most common terms, excluding
stopwords, by means of ts_stat().  
At the moment, it lists: 
  "kitten"          -- correctly treating "kitten" and "kittens" as the
same.
  "chlorin"        -- correctly merging "chlorination" and "chlorinated",
but creating a non-word.
  "temperatur"  -- right stem, not a word.
  "someth"       -- mistaken parser, has removed the -ing suffix.

So, given the array ["kitten","chlorin","temperatur","someth"], we wish to
un-stem to find the first valid english word whose stem is in that array,
i.e. 
  ["kitten", "chlorine", "temperature", "something"]
Note that it is intentional to retrieve "chlorine" even though the original
inputs were "chlorinated" and "chlorination", and did not necessarily
contain "chlorine"] 

There doesn't seem to be any process for doing this. Not sure whether this
is just something for the documentation, or an RFE for (2). Thanks very
much.

pgsql-docs by date:

From: Noah Misch
Date: 06 April 2019, 21:08:39
Subject: Re: initdb recommendations

From: Peter Eisentraut
Date: 08 April 2019, 15:25:07
Subject: Re: initdb recommendations

How to query the underlying dictionary i.e. inverse of ts_lexize() - Mailing list pgsql-docs

Previous

Next